Skip to content

feat: cute dsl mmfp4 for blackwell#2540

Merged
nv-yunzheq merged 9 commits intoflashinfer-ai:mainfrom
nv-yunzheq:cute_dsl_mmfp4
Feb 21, 2026
Merged

feat: cute dsl mmfp4 for blackwell#2540
nv-yunzheq merged 9 commits intoflashinfer-ai:mainfrom
nv-yunzheq:cute_dsl_mmfp4

Conversation

@nv-yunzheq
Copy link
Copy Markdown
Collaborator

@nv-yunzheq nv-yunzheq commented Feb 11, 2026

📌 Description

Issue #2466

The PR integrate cute_dsl as a new backend for mm_fp4
dense_blockscaled_gemm_sm100.py comes from dense_blockscaled_gemm_persistent.py from TensorRT-LLM.
dense_blockscaled_gemm_sm103.py comes from sm103_dense_blockscaled_gemm_persistent.py from CUTLASS. This file is integrated, but is not currently being used as it requires a pre-released version of nvidia-cutlass-dsl.
gemm_base.py contains main wrapper logic for the mm_fp4 cute dsl gemm kernel
Also upate mm_fp4 unit test and benchmark script to test cute_dsl backend

The performance data:

MMFP4 Benchmark Results

GB200 Non-Autotune

m n k cudnn_time cudnn_tflops cudnn_tb_per_sec cutlass_time cutlass_tflops cutlass_tb_per_sec trtllm_time trtllm_tflops trtllm_tb_per_sec cute_dsl_time cute_dsl_tflops cute_dsl_tb_per_sec best_backend cute_dsl_vs_best_other_speedup
1 512 7168 0.009808 0.7484 0.1876 0.009312 0.7882 0.1976 0.010912 0.6727 0.1686 0.009824 0.7472 0.1873 cutlass 0.9479
1 896 1024 0.004176 0.4394 0.1104 0.004720 0.3888 0.0977 0.004608 0.3982 0.1001 0.004384 0.4186 0.1052 cudnn 0.9526
1 896 5120 0.008192 1.1200 0.2805 0.007888 1.1632 0.2913 0.008896 1.0314 0.2583 0.008128 1.1288 0.2827 cutlass 0.9705
1 1024 7168 0.010224 1.4358 0.3595 0.009600 1.5292 0.3829 0.011216 1.3089 0.3277 0.010192 1.4404 0.3606 cutlass 0.9419
1 1280 8192 0.011312 1.8539 0.4641 0.010448 2.0072 0.5024 0.012448 1.6847 0.4217 0.011311 1.8540 0.4641 cutlass 0.9237
1 1792 5120 0.008528 2.1517 0.5387 0.008416 2.1804 0.5458 0.009328 1.9672 0.4925 0.008592 2.1357 0.5346 cutlass 0.9795
1 2560 8192 0.012160 3.4493 0.8631 0.011440 3.6663 0.9174 0.012896 3.2524 0.8138 0.011888 3.5282 0.8828 cutlass 0.9623
1 3584 5120 0.009376 3.9143 0.9796 0.009360 3.9210 0.9813 0.010048 3.6527 0.9141 0.009408 3.9010 0.9763 cutlass 0.9949
1 4608 7168 0.011776 5.6097 1.4035 0.011264 5.8647 1.4673 0.012960 5.0972 1.2753 0.012320 5.3620 1.3415 cutlass 0.9143
1 5120 640 0.004384 1.4949 0.3761 0.005024 1.3045 0.3282 0.004704 1.3932 0.3505 0.004832 1.3563 0.3413 cudnn 0.9073
1 5120 1024 0.004800 2.1845 0.5484 0.005424 1.9332 0.4853 0.005168 2.0290 0.5093 0.004992 2.1005 0.5273 cudnn 0.9615
1 5120 1280 0.005248 2.4976 0.6265 0.005696 2.3011 0.5772 0.005568 2.3540 0.5905 0.005408 2.4237 0.6079 cudnn 0.9704
1 5120 2048 0.006432 3.2605 0.8169 0.006752 3.1060 0.7782 0.006608 3.1737 0.7951 0.006528 3.2125 0.8049 cudnn 0.9853
1 5120 2560 0.007184 3.6490 0.9139 0.007488 3.5009 0.8768 0.007392 3.5463 0.8881 0.007360 3.5617 0.8920 cudnn 0.9761
1 5120 4096 0.008864 4.7318 1.1843 0.008832 4.7490 1.1886 0.009440 4.4431 1.1121 0.009072 4.6234 1.1572 cutlass 0.9735
1 5120 5120 0.009712 5.3984 1.3509 0.009664 5.4252 1.3576 0.010576 4.9573 1.2405 0.010048 5.2178 1.3057 cutlass 0.9618
1 5120 8192 0.013264 6.3243 1.5822 0.012560 6.6788 1.6708 0.014400 5.8254 1.4574 0.013696 6.1249 1.5323 cutlass 0.9171
1 5120 16384 0.020928 8.0166 2.0050 0.020480 8.1920 2.0489 0.025024 6.7045 1.6768 0.023968 6.9998 1.7507 cutlass 0.8545
1 7168 256 0.003888 0.9439 0.2397 0.004656 0.7882 0.2002 0.004080 0.8995 0.2284 0.004032 0.9102 0.2311 cudnn 0.9643
1 7168 512 0.004160 1.7644 0.4446 0.004928 1.4895 0.3753 0.004624 1.5874 0.4000 0.004512 1.6268 0.4099 cudnn 0.9220
1 7168 4608 0.009664 6.8357 1.7106 0.009840 6.7134 1.6801 0.010752 6.1443 1.5376 0.010432 6.3325 1.5847 cudnn 0.9264
1 7168 5120 0.010336 7.1014 1.7770 0.010400 7.0577 1.7661 0.011968 6.1330 1.5347 0.011792 6.2246 1.5576 cudnn 0.8765
1 8192 1024 0.005184 3.2363 0.8123 0.005936 2.8264 0.7094 0.005520 3.0394 0.7629 0.005440 3.0840 0.7741 cudnn 0.9529
1 8192 2048 0.007200 4.6603 1.1675 0.007888 4.2539 1.0657 0.007632 4.3965 1.1014 0.007392 4.5393 1.1372 cudnn 0.9740
1 8192 3584 0.008800 6.6728 1.6703 0.009360 6.2735 1.5703 0.010000 5.8720 1.4698 0.009824 5.9772 1.4962 cudnn 0.8958
1 8192 4096 0.009328 7.1943 1.8006 0.009952 6.7433 1.6877 0.010848 6.1863 1.5483 0.010896 6.1590 1.5415 cudnn 0.8561
1 8192 7168 0.013344 8.8010 2.2017 0.013584 8.6455 2.1628 0.016608 7.0713 1.7690 0.016480 7.1262 1.7828 cudnn 0.8097
1 8192 8192 0.014624 9.1779 2.2959 0.014400 9.3207 2.3316 0.018672 7.1882 1.7981 0.017968 7.4698 1.8686 cutlass 0.8014
1 8192 14336 0.022528 10.4262 2.6076 0.022096 10.6300 2.6586 0.029600 7.9353 1.9846 0.028496 8.2426 2.0615 cutlass 0.7754
1 8192 28672 0.040080 11.7206 2.9309 0.040080 11.7206 2.9309 0.057424 8.1806 2.0457 0.054272 8.6557 2.1645 cudnn 0.7385
1 9216 7168 0.013648 9.6806 2.4218 0.016752 7.8869 1.9730 0.019040 6.9393 1.7360 0.018496 7.1432 1.7870 cudnn 0.7379
1 10240 8192 0.016496 10.1705 2.5441 0.025280 6.6366 1.6601 0.021904 7.6594 1.9160 0.021104 7.9498 1.9886 cudnn 0.7817
4 512 7168 0.009792 2.9984 0.1893 0.009056 3.2421 0.2047 0.010944 2.6828 0.1694 0.009824 2.9886 0.1887 cutlass 0.9218
4 896 1024 0.004176 1.7577 0.1121 0.004656 1.5765 0.1005 0.004608 1.5929 0.1016 0.004368 1.6804 0.1071 cudnn 0.9560
4 1024 7168 0.010112 5.8070 0.3652 0.009376 6.2628 0.3938 0.011152 5.2654 0.3311 0.010064 5.8347 0.3669 cutlass 0.9316
4 4608 7168 0.011744 22.5001 1.4106 0.011408 23.1628 1.4522 0.012864 20.5411 1.2878 0.012320 21.4481 1.3447 cutlass 0.9260
4 7168 256 0.003808 3.8551 0.2561 0.004544 3.2306 0.2146 0.004096 3.5840 0.2381 0.004096 3.5840 0.2381 cudnn 0.9297
4 7168 512 0.004336 6.7712 0.4367 0.005024 5.8440 0.3769 0.004592 6.3938 0.4123 0.004560 6.4386 0.4152 cudnn 0.9509
4 7168 2304 0.006976 18.9393 1.1926 0.007488 17.6443 1.1110 0.007360 17.9512 1.1304 0.007232 18.2689 1.1504 cudnn 0.9646
4 7168 4608 0.009712 27.2077 1.7073 0.009824 26.8975 1.6879 0.011120 23.7627 1.4912 0.010928 24.1802 1.5174 cudnn 0.8887
4 9216 7168 0.013488 39.1817 2.4554 0.013728 38.4967 2.4125 0.018992 27.8266 1.7438 0.018032 29.3080 1.8366 cudnn 0.7480
8 896 5120 0.008128 9.0306 0.2865 0.007840 9.3623 0.2970 0.008976 8.1774 0.2594 0.008224 8.9251 0.2831 cutlass 0.9533
8 1280 8192 0.011248 14.9157 0.4709 0.010336 16.2318 0.5124 0.012416 13.5126 0.4266 0.011296 14.8524 0.4688 cutlass 0.9150
8 1792 5120 0.008608 17.0540 0.5386 0.008448 17.3770 0.5488 0.009408 15.6038 0.4928 0.008640 16.9908 0.5367 cutlass 0.9778
8 2560 8192 0.012160 27.5941 0.8684 0.011392 29.4544 0.9269 0.013056 25.7004 0.8088 0.011888 28.2255 0.8882 cutlass 0.9583
8 3584 5120 0.009360 31.3677 0.9886 0.009344 31.4214 0.9902 0.010032 29.2665 0.9223 0.009472 30.9968 0.9769 cutlass 0.9865
8 5120 640 0.004416 11.8725 0.3901 0.005072 10.3369 0.3397 0.004816 10.8864 0.3577 0.004752 11.0330 0.3626 cudnn 0.9293
8 5120 1024 0.004736 17.7124 0.5717 0.005376 15.6038 0.5036 0.005232 16.0333 0.5175 0.004992 16.8041 0.5424 cudnn 0.9487
8 5120 1280 0.005136 20.4162 0.6550 0.005664 18.5130 0.5939 0.005664 18.5130 0.5939 0.005312 19.7398 0.6333 cudnn 0.9669
8 5120 2048 0.006336 26.4792 0.8417 0.006720 24.9661 0.7936 0.006688 25.0856 0.7974 0.006544 25.6376 0.8149 cudnn 0.9682
8 5120 2560 0.007152 29.3226 0.9292 0.007488 28.0068 0.8875 0.007360 28.4939 0.9030 0.007312 28.6810 0.9089 cudnn 0.9781
8 5120 4096 0.008848 37.9232 1.1962 0.008832 37.9919 1.1984 0.009168 36.5995 1.1545 0.008976 37.3824 1.1792 cutlass 0.9840
8 5120 5120 0.009824 42.6945 1.3446 0.009680 43.3296 1.3646 0.010624 39.4795 1.2434 0.009984 42.0103 1.3231 cutlass 0.9696
8 5120 8192 0.013488 49.7545 1.5633 0.012672 52.9584 1.6640 0.014368 46.7072 1.4676 0.013648 49.1712 1.5450 cutlass 0.9285
8 5120 16384 0.021695 61.8643 1.9401 0.020752 64.6770 2.0283 0.024848 54.0155 1.6939 0.023136 58.0125 1.8193 cutlass 0.8970
8 7168 5120 0.010288 57.0765 1.7968 0.010240 57.3440 1.8052 0.011632 50.4817 1.5892 0.011312 51.9097 1.6341 cutlass 0.9052
8 8192 1024 0.005264 25.4973 0.8225 0.005984 22.4294 0.7235 0.005600 23.9675 0.7731 0.005568 24.1052 0.7776 cudnn 0.9454
8 8192 2048 0.007136 37.6171 1.1950 0.007488 35.8488 1.1389 0.007424 36.1578 1.1487 0.007392 36.3143 1.1537 cudnn 0.9654
8 8192 3584 0.008784 53.4793 1.6878 0.008960 52.4288 1.6546 0.009808 47.8958 1.5116 0.009424 49.8474 1.5732 cudnn 0.9321
8 8192 4096 0.009392 57.1626 1.8020 0.009568 56.1111 1.7689 0.010720 50.0812 1.5788 0.010496 51.1500 1.6125 cudnn 0.8948
8 8192 7168 0.013280 70.7473 2.2229 0.012960 72.4941 2.2778 0.015984 58.7790 1.8468 0.015536 60.4740 1.9001 cutlass 0.8342
8 8192 8192 0.014752 72.7862 2.2857 0.014160 75.8292 2.3812 0.017248 62.2531 1.9549 0.016704 64.2805 2.0186 cutlass 0.8477
8 8192 14336 0.022288 84.3076 2.6431 0.021824 86.1001 2.6993 0.027936 67.2626 2.1087 0.027040 69.4914 2.1786 cutlass 0.8071
8 8192 28672 0.040096 93.7275 2.9351 0.039584 94.9398 2.9731 0.053824 69.8219 2.1865 0.050464 74.4708 2.3321 cutlass 0.7844
8 10240 8192 0.016352 82.0803 2.5770 0.020640 65.0280 2.0416 0.020720 64.7769 2.0338 0.019472 68.9303 2.1642 cudnn 0.8398
16 512 7168 0.009824 11.9544 0.1943 0.009184 12.7875 0.2078 0.010960 10.7154 0.1742 0.009856 11.9156 0.1937 cutlass 0.9318
16 896 1024 0.004256 6.8985 0.1165 0.004736 6.1994 0.1046 0.004608 6.3716 0.1076 0.004352 6.7464 0.1139 cudnn 0.9779
16 1024 7168 0.010048 23.3759 0.3742 0.009408 24.9661 0.3997 0.011296 20.7933 0.3329 0.010032 23.4132 0.3748 cutlass 0.9378
16 4608 7168 0.011824 89.3915 1.4141 0.011360 93.0427 1.4718 0.012656 83.5149 1.3211 0.011936 88.5527 1.4008 cutlass 0.9517
16 7168 256 0.003936 14.9188 0.2919 0.004608 12.7431 0.2493 0.004080 14.3922 0.2816 0.004000 14.6801 0.2872 cudnn 0.9840
16 7168 512 0.004208 27.9089 0.4916 0.004896 23.9870 0.4225 0.004544 25.8452 0.4552 0.004448 26.4030 0.4650 cudnn 0.9460
16 7168 2304 0.007008 75.4113 1.2137 0.007488 70.5772 1.1359 0.007456 70.8801 1.1407 0.007392 71.4938 1.1506 cudnn 0.9481
16 7168 4608 0.009664 109.3713 1.7365 0.009808 107.7656 1.7110 0.010560 100.0913 1.5891 0.010272 102.8976 1.6337 cudnn 0.9408
16 9216 7168 0.013648 154.8893 2.4460 0.013632 155.0711 2.4488 0.016240 130.1681 2.0556 0.015680 134.8169 2.1290 cutlass 0.8694
64 512 7168 0.009872 47.5853 0.2158 0.009120 51.5090 0.2335 0.011136 42.1841 0.1913 0.009712 48.3692 0.2193 cutlass 0.9390
64 896 1024 0.004256 27.5941 0.1424 0.004672 25.1371 0.1298 0.004848 24.2245 0.1250 0.004384 26.7884 0.1383 cudnn 0.9708
64 896 5120 0.008240 71.2624 0.3122 0.007920 74.1417 0.3248 0.009120 64.3862 0.2820 0.008160 71.9611 0.3152 cutlass 0.9706
64 1280 8192 0.011200 119.8373 0.5061 0.010400 129.0555 0.5451 0.012720 105.5171 0.4457 0.011248 119.3259 0.5040 cutlass 0.9246
64 1792 5120 0.008720 134.6795 0.5712 0.008384 140.0769 0.5941 0.009600 122.3339 0.5188 0.008624 136.1787 0.5775 cutlass 0.9722
64 2560 8192 0.012224 219.5971 0.9061 0.011632 230.7733 0.9522 0.013120 204.6002 0.8442 0.012048 222.8050 0.9193 cutlass 0.9655
64 3584 5120 0.009504 247.1391 1.0309 0.009408 249.6610 1.0414 0.010112 232.2795 0.9689 0.009472 247.9741 1.0344 cutlass 0.9932
64 4608 7168 0.011872 356.1202 1.4601 0.011408 370.6047 1.5195 0.012896 327.8426 1.3442 0.011712 360.9852 1.4800 cutlass 0.9740
64 5120 640 0.004496 93.2897 0.5147 0.005088 82.4352 0.4548 0.004912 85.3889 0.4711 0.004704 89.1646 0.4920 cudnn 0.9558
64 5120 1024 0.004864 137.9705 0.6804 0.005408 124.0918 0.6120 0.005440 123.3619 0.6084 0.004992 134.4328 0.6630 cudnn 0.9744
64 5120 1280 0.005312 157.9181 0.7480 0.005792 144.8309 0.6860 0.005792 144.8309 0.6860 0.005376 156.0381 0.7390 cudnn 0.9881
64 5120 2048 0.006432 208.6718 0.9272 0.006752 198.7822 0.8833 0.006992 191.9590 0.8529 0.006592 203.6070 0.9047 cudnn 0.9757
64 5120 2560 0.007216 232.5002 1.0104 0.007520 223.1013 0.9695 0.007648 219.3674 0.9533 0.007360 227.9513 0.9906 cudnn 0.9804
64 5120 4096 0.008960 299.5931 1.2581 0.008896 301.7485 1.2671 0.009440 284.3596 1.1941 0.008960 299.5931 1.2581 cutlass 0.9929
64 5120 5120 0.009984 336.0821 1.3949 0.009872 339.8950 1.4107 0.010608 316.3125 1.3128 0.009856 340.4468 1.4130 cute_dsl 1.0016
64 5120 8192 0.013536 396.6245 1.6171 0.012768 420.4816 1.7144 0.014784 363.1432 1.4806 0.013280 404.2703 1.6483 cutlass 0.9614
64 5120 16384 0.021168 507.2476 2.0372 0.020736 517.8153 2.0796 0.024736 434.0806 1.7433 0.022336 480.7225 1.9306 cutlass 0.9284
64 7168 256 0.004032 58.2542 0.4571 0.004672 50.2742 0.3945 0.004320 54.3706 0.4267 0.004048 58.0240 0.4553 cudnn 0.9960
64 7168 512 0.004256 110.3764 0.6506 0.004960 94.7101 0.5582 0.004736 99.1896 0.5846 0.004544 103.3807 0.6094 cudnn 0.9366
64 7168 2304 0.007168 294.9120 1.2903 0.007488 282.3089 1.2351 0.007584 278.7354 1.2195 0.007312 289.1041 1.2649 cudnn 0.9803
64 7168 4608 0.009856 428.9629 1.7837 0.009952 424.8250 1.7665 0.010400 406.5248 1.6904 0.009760 433.1822 1.8012 cute_dsl 1.0098
64 7168 5120 0.010464 448.9316 1.8570 0.010336 454.4911 1.8800 0.011264 417.0473 1.7251 0.010400 451.6943 1.8684 cutlass 0.9938
64 8192 1024 0.005408 198.5469 0.9755 0.005920 181.3753 0.8912 0.005728 187.4549 0.9210 0.005520 194.5184 0.9557 cudnn 0.9797
64 8192 2048 0.007296 294.3371 1.3025 0.007632 281.3789 1.2451 0.007520 285.5696 1.2637 0.007360 291.7777 1.2911 cudnn 0.9913
64 8192 3584 0.008976 418.6828 1.7651 0.009136 411.3503 1.7342 0.009552 393.4356 1.6586 0.009280 404.9673 1.7073 cudnn 0.9672
64 8192 4096 0.009584 448.1393 1.8736 0.009680 443.6950 1.8550 0.010080 426.0880 1.7814 0.009568 448.8887 1.8768 cute_dsl 1.0017
64 8192 7168 0.013424 559.9071 2.2823 0.012912 582.1091 2.3728 0.014560 516.2220 2.1043 0.013296 565.2973 2.3043 cutlass 0.9711
64 8192 8192 0.014656 586.1036 2.3789 0.014144 607.3200 2.4650 0.016128 532.6100 2.1618 0.014592 588.6742 2.3893 cutlass 0.9693
64 8192 14336 0.022608 664.9144 2.6640 0.021920 685.7840 2.7476 0.025600 587.2026 2.3526 0.023664 635.2428 2.5451 cutlass 0.9263
64 8192 28672 0.039760 756.1562 3.0032 0.039520 760.7483 3.0214 0.047456 633.5294 2.5162 0.043760 687.0377 2.7287 cutlass 0.9031
64 9216 7168 0.013871 609.5748 2.4827 0.014208 595.1377 2.4239 0.014816 570.7152 2.3245 0.013664 618.8317 2.5204 cute_dsl 1.0152
64 10240 8192 0.016720 642.1901 2.6026 0.020848 515.0335 2.0873 0.017920 599.1863 2.4283 0.016736 641.5761 2.6001 cudnn 0.9990
256 512 7168 0.010224 183.7880 0.2949 0.009376 200.4104 0.3215 0.010848 173.2161 0.2779 0.009824 191.2712 0.3069 cutlass 0.9544
256 896 1024 0.004384 107.1538 0.2392 0.004832 97.2190 0.2170 0.004992 94.1030 0.2101 0.004352 107.9416 0.2409 cute_dsl 1.0074
256 1024 7168 0.008736 430.1850 0.5851 0.009760 385.0509 0.5238 0.011616 323.5276 0.4401 0.010208 368.1521 0.5008 cudnn 0.8558
256 4608 7168 0.010560 1601.4615 1.8742 0.011712 1443.9407 1.6899 0.013472 1255.3024 1.4691 0.012048 1403.6715 1.6428 cudnn 0.8765
256 7168 256 0.004064 231.1821 1.1369 0.004672 201.0968 0.9889 0.004544 206.7615 1.0168 0.004064 231.1821 1.1369 cudnn 1.0000
256 7168 512 0.004416 425.5091 1.2614 0.005072 370.4748 1.0983 0.004960 378.8404 1.1231 0.004464 420.9337 1.2479 cudnn 0.9892
256 7168 2304 0.006912 1223.3387 1.7683 0.007792 1085.1793 1.5686 0.007936 1065.4885 1.5401 0.007360 1148.8746 1.6607 cudnn 0.9391
256 7168 4608 0.009184 1841.4018 2.2621 0.010080 1677.7216 2.0610 0.010864 1556.6489 1.9123 0.010160 1664.5112 2.0448 cudnn 0.9039
256 9216 7168 0.012576 2689.4774 3.0746 0.013472 2510.6048 2.8701 0.016080 2103.4122 2.4046 0.014624 2312.8328 2.6440 cudnn 0.8600
512 896 5120 0.007456 630.0879 0.6065 0.008320 564.6178 0.5435 0.009248 507.9607 0.4890 0.008448 556.0630 0.5353 cudnn 0.8825
512 1280 8192 0.011728 915.5370 0.7376 0.011056 971.1847 0.7824 0.012816 837.8135 0.6750 0.011584 926.9180 0.7468 cutlass 0.9544
512 1792 5120 0.007872 1193.5011 0.9824 0.008816 1065.7034 0.8772 0.009616 977.0425 0.8042 0.008816 1065.7034 0.8772 cudnn 0.8929
512 2560 8192 0.010768 1994.3199 1.4120 0.011920 1801.5802 1.2755 0.013536 1586.4980 1.1233 0.012208 1759.0790 1.2454 cudnn 0.8820
512 3584 5120 0.008960 2097.1520 1.5799 0.009856 1906.5018 1.4363 0.010528 1784.8102 1.3446 0.009664 1944.3793 1.4648 cudnn 0.9272
512 5120 640 0.005632 595.7818 1.2509 0.006048 554.8021 1.1649 0.008352 401.7533 0.8435 0.005424 618.6289 1.2989 cute_dsl 1.0383
512 5120 1024 0.006064 885.3412 1.3401 0.006480 828.5684 1.2542 0.009040 593.8838 0.8989 0.005888 911.8052 1.3802 cute_dsl 1.0299
512 5120 1280 0.006656 1008.2462 1.3292 0.007008 957.6037 1.2625 0.009664 694.4212 0.9155 0.006880 975.4195 1.2860 cudnn 0.9674
512 5120 2048 0.008096 1326.2621 1.3599 0.008256 1300.5594 1.3336 0.011456 937.2746 0.9611 0.008848 1213.5418 1.2444 cudnn 0.9150
512 5120 2560 0.009072 1479.4723 1.3726 0.009520 1409.8501 1.3080 0.012352 1086.6073 1.0081 0.009856 1361.7870 1.2634 cudnn 0.9205
512 5120 4096 0.011424 1879.8001 1.4686 0.011856 1811.3054 1.4151 0.015616 1375.1816 1.0744 0.012880 1667.3010 1.3026 cudnn 0.8870
512 5120 5120 0.012736 2107.6905 1.5437 0.013408 2002.0544 1.4663 0.017824 1506.0338 1.1031 0.014688 1827.5834 1.3386 cudnn 0.8671
512 5120 8192 0.017072 2515.7962 1.6584 0.017744 2420.5181 1.5956 0.024592 1746.4896 1.1513 0.020608 2084.1262 1.3738 cudnn 0.8284
512 5120 16384 0.030064 2857.2161 1.7090 0.032432 2648.5985 1.5842 0.044320 1938.1621 1.1593 0.038832 2212.1048 1.3232 cudnn 0.7742
512 7168 5120 0.013472 2789.5609 2.0042 0.014703 2555.9196 1.8364 0.019504 1926.8337 1.3844 0.015920 2360.6133 1.6960 cudnn 0.8462
512 8192 1024 0.006688 1284.3802 1.9206 0.007136 1203.7464 1.8000 0.009968 861.7511 1.2886 0.007200 1193.0465 1.7840 cudnn 0.9289
512 8192 2048 0.008544 2010.7525 2.0250 0.009824 1748.7652 1.7611 0.012911 1330.5866 1.3400 0.009552 1798.5625 1.8113 cudnn 0.8945
512 8192 3584 0.010656 2821.3937 2.2510 0.012272 2449.8673 1.9545 0.016448 1827.8679 1.4583 0.012848 2340.0351 1.8669 cudnn 0.8294
512 8192 4096 0.011552 2974.3541 2.2693 0.013344 2574.9204 1.9645 0.017504 1962.9649 1.4976 0.013760 2497.0740 1.9051 cudnn 0.8395
512 8192 7168 0.015664 3838.7093 2.5271 0.019264 3121.3425 2.0548 0.025792 2331.3253 1.5347 0.021040 2857.8680 1.8814 cudnn 0.7445
512 8192 8192 0.017216 3991.6053 2.5581 0.021360 3217.2040 2.0618 0.028320 2426.4924 1.5551 0.023744 2894.1828 1.8548 cudnn 0.7251
512 8192 14336 0.027599 4357.2921 2.5645 0.034640 3471.6826 2.0433 0.047712 2520.5207 1.4835 0.039392 3052.8809 1.7968 cudnn 0.7006
512 8192 28672 0.053648 4483.2644 2.4823 0.061664 3900.4633 2.1596 0.087280 2755.7077 1.5258 0.070656 3404.0728 1.8848 cudnn 0.7593
512 10240 8192 0.026240 3273.6031 2.0780 0.028192 3046.9405 1.9341 0.040080 2143.2240 1.3604 0.033936 2531.2533 1.6068 cudnn 0.7732
1024 512 7168 0.009184 818.4008 0.7136 0.009984 752.8238 0.6564 0.011856 633.9569 0.5528 0.010160 739.7828 0.6450 cudnn 0.9039
1024 896 1024 0.004352 431.7666 0.6475 0.005088 369.3098 0.5539 0.005056 371.6472 0.5574 0.004480 419.4304 0.6290 cudnn 0.9714
1024 1024 7168 0.009232 1628.2913 1.0222 0.010464 1436.5812 0.9019 0.011904 1262.8012 0.7928 0.010640 1412.8182 0.8870 cudnn 0.8677
1024 4608 7168 0.015072 4488.1724 1.9654 0.017712 3819.2036 1.6724 0.024064 2811.0761 1.2310 0.019072 3546.8611 1.5532 cudnn 0.7903
1024 7168 256 0.005664 663.5057 2.7769 0.006256 600.7187 2.5142 0.010896 344.9061 1.4435 0.005600 671.0886 2.8087 cute_dsl 1.0114
1024 7168 512 0.006464 1162.7773 2.5955 0.006912 1087.4121 2.4273 0.011904 631.4006 1.4094 0.006080 1236.2159 2.7594 cute_dsl 1.0632
1024 7168 4608 0.017440 3878.7692 1.9240 0.018336 3689.2307 1.8300 0.026640 2539.2543 1.2596 0.019424 3482.5852 1.7275 cudnn 0.8979
1024 9216 7168 0.025904 5222.8023 2.1454 0.031344 4316.3435 1.7731 0.045376 2981.5645 1.2248 0.035296 3833.0539 1.5745 cudnn 0.7339

GB200 Autotune

m n k cudnn_time cudnn_tflops cudnn_tb_per_sec cutlass_time cutlass_tflops cutlass_tb_per_sec trtllm_time trtllm_tflops trtllm_tb_per_sec cute_dsl_time cute_dsl_tflops cute_dsl_tb_per_sec best_backend cute_dsl_vs_best_other_speedup
1 512 7168 0.009760 0.7521 0.1885 0.009088 0.8077 0.2024 0.010944 0.6707 0.1681 0.007712 0.9518 0.2385 cute_dsl 1.1784
1 896 1024 0.004192 0.4377 0.1100 0.004288 0.4279 0.1075 0.004816 0.3810 0.0957 0.004016 0.4569 0.1148 cute_dsl 1.0438
1 896 5120 0.008192 1.1201 0.2805 0.008000 1.1469 0.2873 0.010256 0.8946 0.2241 0.006656 1.3785 0.3453 cute_dsl 1.2019
1 1024 7168 0.010192 1.4404 0.3606 0.009520 1.5420 0.3861 0.011200 1.3107 0.3282 0.008064 1.8204 0.4558 cute_dsl 1.1806
1 1280 8192 0.011200 1.8725 0.4687 0.010399 2.0166 0.5048 0.015072 1.3914 0.3483 0.008768 2.3918 0.5987 cute_dsl 1.1861
1 1792 5120 0.008576 2.1397 0.5356 0.008352 2.1971 0.5500 0.011040 1.6621 0.4161 0.007088 2.5889 0.6481 cute_dsl 1.1783
1 2560 8192 0.012192 3.4402 0.8608 0.011472 3.6561 0.9148 0.015744 2.6641 0.6666 0.009872 4.2487 1.0631 cute_dsl 1.1621
1 3584 5120 0.009328 3.9346 0.9847 0.009360 3.9210 0.9813 0.011568 3.1726 0.7940 0.008128 4.5153 1.1300 cute_dsl 1.1476
1 4608 7168 0.011760 5.6174 1.4054 0.011392 5.7988 1.4508 0.012880 5.1289 1.2832 0.010048 6.5745 1.6449 cute_dsl 1.1338
1 5120 640 0.004576 1.4322 0.3603 0.004624 1.4173 0.3566 0.004736 1.3838 0.3482 0.004544 1.4423 0.3629 cute_dsl 1.0070
1 5120 1024 0.004832 2.1701 0.5447 0.005216 2.0103 0.5046 0.005472 1.9163 0.4810 0.004640 2.2599 0.5673 cute_dsl 1.0414
1 5120 1280 0.005280 2.4824 0.6227 0.005568 2.3540 0.5905 0.005616 2.3339 0.5854 0.004960 2.6426 0.6628 cute_dsl 1.0645
1 5120 2048 0.006448 3.2524 0.8148 0.006896 3.0411 0.7619 0.006752 3.1060 0.7782 0.005792 3.6208 0.9071 cute_dsl 1.1133
1 5120 2560 0.007072 3.7068 0.9283 0.007392 3.5463 0.8881 0.008224 3.1875 0.7983 0.006368 4.1166 1.0310 cute_dsl 1.1106
1 5120 4096 0.008912 4.7064 1.1780 0.008976 4.6728 1.1696 0.009408 4.4582 1.1159 0.007776 5.3939 1.3501 cute_dsl 1.1461
1 5120 5120 0.009792 5.3542 1.3399 0.009792 5.3542 1.3399 0.010512 4.9875 1.2481 0.008592 6.1020 1.5270 cute_dsl 1.1397
1 5120 8192 0.014080 5.9578 1.4905 0.012640 6.6366 1.6603 0.014320 5.8580 1.4655 0.011200 7.4898 1.8737 cute_dsl 1.1286
1 5120 16384 0.024112 6.9580 1.7403 0.020432 8.2112 2.0537 0.025552 6.5659 1.6422 0.018592 9.0239 2.2570 cute_dsl 1.0990
1 7168 256 0.003935 0.9325 0.2368 0.004272 0.8591 0.2182 0.004096 0.8960 0.2275 0.003920 0.9362 0.2377 cute_dsl 1.0040
1 7168 512 0.004352 1.6866 0.4250 0.004704 1.5604 0.3932 0.004576 1.6040 0.4042 0.004256 1.7246 0.4346 cute_dsl 1.0226
1 7168 4608 0.009568 6.9043 1.7278 0.010336 6.3913 1.5994 0.010880 6.0717 1.5195 0.009984 6.6166 1.6558 cudnn 0.9583
1 7168 5120 0.010352 7.0904 1.7742 0.011072 6.6294 1.6589 0.015392 4.7687 1.1933 0.010496 6.9932 1.7499 cudnn 0.9863
1 8192 1024 0.005248 3.1969 0.8024 0.005856 2.8650 0.7191 0.005888 2.8494 0.7152 0.005280 3.1775 0.7976 cudnn 0.9939
1 8192 2048 0.007072 4.7447 1.1886 0.007712 4.3509 1.0900 0.007392 4.5393 1.1372 0.007024 4.7771 1.1968 cute_dsl 1.0068
1 8192 3584 0.008880 6.6126 1.6552 0.009504 6.1785 1.5465 0.010688 5.4940 1.3752 0.009088 6.4613 1.6173 cudnn 0.9771
1 8192 4096 0.009344 7.1820 1.7975 0.009984 6.7216 1.6823 0.011008 6.0964 1.5258 0.009728 6.8985 1.7265 cudnn 0.9605
1 8192 7168 0.013488 8.7070 2.1782 0.014224 8.2565 2.0655 0.016992 6.9115 1.7291 0.013824 8.4954 2.1253 cudnn 0.9757
1 8192 8192 0.014656 9.1579 2.2909 0.015488 8.6659 2.1678 0.022976 5.8416 1.4613 0.015008 8.9431 2.2371 cudnn 0.9765
1 8192 14336 0.022656 10.3673 2.5929 0.025008 9.3922 2.3490 0.031616 7.4292 1.8580 0.024496 9.5885 2.3981 cudnn 0.9249
1 8192 28672 0.040720 11.5364 2.8849 0.040464 11.6094 2.9031 0.055696 8.4344 2.1092 0.038384 12.2385 3.0604 cute_dsl 1.0542
1 9216 7168 0.013808 9.5684 2.3937 0.014592 9.0543 2.2651 0.019088 6.9217 1.7316 0.014144 9.3411 2.3368 cudnn 0.9762
1 10240 8192 0.016752 10.0151 2.5052 0.016848 9.9580 2.4910 0.021216 7.9078 1.9781 0.016416 10.2200 2.5565 cute_dsl 1.0205
4 512 7168 0.009952 2.9502 0.1862 0.009056 3.2421 0.2047 0.011008 2.6672 0.1684 0.007648 3.8389 0.2423 cute_dsl 1.1841
4 896 1024 0.004256 1.7246 0.1100 0.004352 1.6866 0.1075 0.004816 1.5241 0.0972 0.004032 1.8204 0.1161 cute_dsl 1.0556
4 1024 7168 0.010256 5.7255 0.3600 0.009360 6.2735 0.3945 0.013536 4.3381 0.2728 0.008080 7.2674 0.4570 cute_dsl 1.1584
4 4608 7168 0.011808 22.3781 1.4030 0.011360 23.2607 1.4583 0.014784 17.8735 1.1206 0.009952 26.5516 1.6646 cute_dsl 1.1415
4 7168 256 0.003808 3.8551 0.2561 0.004224 3.4754 0.2309 0.004032 3.6409 0.2419 0.003872 3.7913 0.2519 cudnn 0.9835
4 7168 512 0.004352 6.7464 0.4351 0.004672 6.2843 0.4053 0.004704 6.2415 0.4025 0.004224 6.9508 0.4482 cute_dsl 1.0303
4 7168 2304 0.007072 18.6822 1.1764 0.007584 17.4210 1.0970 0.007536 17.5319 1.1040 0.007119 18.5576 1.1685 cudnn 0.9933
4 7168 4608 0.009696 27.2526 1.7102 0.010448 25.2911 1.5871 0.011840 22.3177 1.4005 0.008752 30.1921 1.8946 cute_dsl 1.1079
4 9216 7168 0.013696 38.5866 2.4181 0.014544 36.3368 2.2771 0.018880 27.9916 1.7541 0.013904 38.0094 2.3819 cudnn 0.9850
8 896 5120 0.008064 9.1022 0.2888 0.007808 9.4007 0.2982 0.009008 8.1483 0.2585 0.006640 11.0543 0.3507 cute_dsl 1.1759
8 1280 8192 0.011232 14.9376 0.4715 0.010432 16.0825 0.5077 0.012512 13.4089 0.4233 0.009024 18.5918 0.5869 cute_dsl 1.1560
8 1792 5120 0.008640 16.9908 0.5367 0.008384 17.5096 0.5530 0.011024 13.3165 0.4206 0.007024 20.8999 0.6601 cute_dsl 1.1936
8 2560 8192 0.012064 27.8137 0.8753 0.011392 29.4544 0.9269 0.013184 25.4509 0.8009 0.009888 33.9345 1.0679 cute_dsl 1.1521
8 3584 5120 0.009344 31.4214 0.9902 0.008576 34.2352 1.0789 0.009888 29.6927 0.9358 0.007856 37.3729 1.1778 cute_dsl 1.0916
8 5120 640 0.004864 10.7789 0.3542 0.004640 11.2993 0.3713 0.004768 10.9960 0.3613 0.004384 11.9591 0.3930 cute_dsl 1.0584
8 5120 1024 0.005088 16.4870 0.5321 0.005104 16.4354 0.5305 0.005120 16.3840 0.5288 0.004608 18.2044 0.5876 cute_dsl 1.1042
8 5120 1280 0.005328 19.6805 0.6314 0.005472 19.1626 0.6147 0.005632 18.6182 0.5973 0.004880 21.4872 0.6893 cute_dsl 1.0918
8 5120 2048 0.006560 25.5750 0.8130 0.006288 26.6813 0.8481 0.007408 22.6474 0.7199 0.005728 29.2898 0.9310 cute_dsl 1.0978
8 5120 2560 0.007040 29.7891 0.9440 0.006880 30.4819 0.9660 0.008096 25.9036 0.8209 0.006336 33.0990 1.0489 cute_dsl 1.0859
8 5120 4096 0.008992 37.3159 1.1771 0.008912 37.6508 1.1876 0.009200 36.4722 1.1504 0.007520 44.6203 1.4075 cute_dsl 1.1851
8 5120 5120 0.009984 42.0103 1.3231 0.009664 43.4013 1.3669 0.010464 40.0832 1.2624 0.008192 51.2000 1.6125 cute_dsl 1.1797
8 5120 8192 0.013904 48.2659 1.5166 0.012896 52.0385 1.6351 0.014272 47.0213 1.4775 0.011216 59.8332 1.8800 cute_dsl 1.1498
8 5120 16384 0.023424 57.2992 1.7969 0.020544 65.3318 2.0488 0.024784 54.1550 1.6983 0.017616 76.1908 2.3893 cute_dsl 1.1662
8 7168 5120 0.010144 57.8867 1.8223 0.010912 53.8126 1.6940 0.012960 45.3088 1.4263 0.008768 66.9711 2.1083 cute_dsl 1.1569
8 8192 1024 0.005216 25.7319 0.8300 0.005728 23.4319 0.7558 0.005856 22.9197 0.7393 0.005056 26.5462 0.8563 cute_dsl 1.0316
8 8192 2048 0.007072 37.9575 1.2059 0.007680 34.9525 1.1104 0.007328 36.6315 1.1637 0.006320 42.4740 1.3493 cute_dsl 1.1190
8 8192 3584 0.008736 53.7731 1.6971 0.009376 50.1026 1.5812 0.009856 47.6625 1.5042 0.007680 61.1669 1.9304 cute_dsl 1.1375
8 8192 4096 0.009536 56.2994 1.7748 0.009984 53.7731 1.6952 0.013456 39.8983 1.2578 0.008304 64.6560 2.0383 cute_dsl 1.1484
8 8192 7168 0.013472 69.7390 2.1912 0.014320 65.6092 2.0614 0.017136 54.8275 1.7227 0.011616 80.8854 2.5414 cute_dsl 1.1598
8 8192 8192 0.014784 72.6286 2.2807 0.015712 68.3390 2.1460 0.017168 62.5432 1.9640 0.012400 86.5921 2.7192 cute_dsl 1.1923
8 8192 14336 0.022432 83.7664 2.6261 0.024608 76.3592 2.3939 0.028304 66.3881 2.0813 0.019008 98.8556 3.0992 cute_dsl 1.1801
8 8192 28672 0.040128 93.6527 2.9328 0.039664 94.7483 2.9671 0.053392 70.3869 2.2042 0.032352 116.1627 3.6377 cute_dsl 1.2260
8 10240 8192 0.016656 80.5822 2.5300 0.016752 80.1228 2.5156 0.020560 65.2826 2.0496 0.014128 95.0012 2.9827 cute_dsl 1.1789
16 512 7168 0.009856 11.9156 0.1937 0.009152 12.8322 0.2086 0.010976 10.6998 0.1739 0.007616 15.4202 0.2506 cute_dsl 1.2017
16 896 1024 0.004176 7.0307 0.1187 0.004288 6.8470 0.1156 0.004928 5.9578 0.1006 0.003936 7.4594 0.1259 cute_dsl 1.0610
16 1024 7168 0.010272 22.8661 0.3661 0.009504 24.7139 0.3956 0.011295 20.7942 0.3329 0.007968 29.4780 0.4719 cute_dsl 1.1928
16 4608 7168 0.011856 89.1502 1.4102 0.011312 93.4375 1.4781 0.012689 83.3010 1.3177 0.009664 109.3713 1.7301 cute_dsl 1.1705
16 7168 256 0.003904 15.0410 0.2943 0.004368 13.4433 0.2630 0.004112 14.2802 0.2794 0.003744 15.6838 0.3069 cute_dsl 1.0427
16 7168 512 0.004288 27.3882 0.4824 0.004672 25.1371 0.4427 0.004688 25.0513 0.4412 0.004176 28.1227 0.4953 cute_dsl 1.0268
16 7168 2304 0.007520 70.2769 1.1310 0.007648 69.1007 1.1121 0.007488 70.5772 1.1359 0.006304 83.8329 1.3492 cute_dsl 1.1878
16 7168 4608 0.009920 106.5489 1.6917 0.010608 99.6384 1.5819 0.011168 94.6422 1.5026 0.008640 122.3339 1.9423 cute_dsl 1.1481
16 9216 7168 0.013632 155.0711 2.4488 0.014480 145.9896 2.3054 0.017503 120.7718 1.9072 0.011872 178.0601 2.8119 cute_dsl 1.1482
64 512 7168 0.009952 47.2028 0.2140 0.008976 52.3353 0.2373 0.011120 42.2448 0.1915 0.007696 61.0398 0.2768 cute_dsl 1.1663
64 896 1024 0.004320 27.1853 0.1403 0.004384 26.7884 0.1383 0.004864 24.1448 0.1246 0.003872 30.3307 0.1566 cute_dsl 1.1157
64 896 5120 0.008208 71.5403 0.3134 0.007888 74.4425 0.3261 0.009184 63.9376 0.2801 0.006688 87.7994 0.3846 cute_dsl 1.1794
64 1280 8192 0.011120 120.6994 0.5098 0.010192 131.6893 0.5562 0.012704 105.6500 0.4462 0.008784 152.7980 0.6454 cute_dsl 1.1603
64 1792 5120 0.008720 134.6795 0.5712 0.008480 138.4912 0.5874 0.011312 103.8194 0.4403 0.007248 162.0316 0.6872 cute_dsl 1.1700
64 2560 8192 0.012049 222.7957 0.9193 0.011488 233.6660 0.9641 0.013184 203.6070 0.8401 0.009888 271.4760 1.1201 cute_dsl 1.1618
64 3584 5120 0.009488 247.5559 1.0326 0.008608 272.8795 1.1383 0.011696 200.8217 0.8377 0.007872 298.3753 1.2446 cute_dsl 1.0934
64 4608 7168 0.011872 356.1202 1.4601 0.011328 373.2220 1.5302 0.014912 283.5205 1.1624 0.009696 436.0415 1.7878 cute_dsl 1.1683
64 5120 640 0.004624 90.7073 0.5005 0.004608 91.0222 0.5022 0.004896 85.6680 0.4727 0.004304 97.4513 0.5377 cute_dsl 1.0706
64 5120 1024 0.004928 136.1787 0.6716 0.004992 134.4328 0.6630 0.005312 126.3345 0.6230 0.004544 147.6868 0.7283 cute_dsl 1.0845
64 5120 1280 0.005296 158.3952 0.7502 0.005344 156.9725 0.7435 0.005840 143.6405 0.6803 0.004880 171.8977 0.8142 cute_dsl 1.0852
64 5120 2048 0.006560 204.6002 0.9091 0.006272 213.9951 0.9509 0.006976 192.3993 0.8549 0.005728 234.3187 1.0412 cute_dsl 1.0950
64 5120 2560 0.007360 227.9513 0.9906 0.006880 243.8549 1.0597 0.008304 202.0378 0.8780 0.006224 269.5568 1.1714 cute_dsl 1.1054
64 5120 4096 0.009024 297.4684 1.2491 0.008272 324.5109 1.3627 0.009376 286.3006 1.2022 0.007584 353.9497 1.4863 cute_dsl 1.0907
64 5120 5120 0.009856 340.4468 1.4130 0.009728 344.9263 1.4316 0.010624 315.8361 1.3108 0.008192 409.6000 1.7000 cute_dsl 1.1875
64 5120 8192 0.013455 398.9974 1.6268 0.012944 414.7643 1.6911 0.014528 369.5422 1.5067 0.010880 493.4475 2.0119 cute_dsl 1.1897
64 5120 16384 0.023072 465.3975 1.8691 0.020640 520.2238 2.0893 0.024864 431.8460 1.7343 0.018400 583.5553 2.3436 cute_dsl 1.1217
64 7168 256 0.003968 59.1938 0.4645 0.004288 54.7764 0.4299 0.004320 54.3706 0.4267 0.003680 63.8264 0.5009 cute_dsl 1.0783
64 7168 512 0.004352 107.9416 0.6362 0.004672 100.5484 0.5927 0.004704 99.8644 0.5886 0.004048 116.0479 0.6840 cute_dsl 1.0751
64 7168 2304 0.007040 300.2740 1.3137 0.007088 298.2406 1.3048 0.007472 282.9134 1.2378 0.006256 337.9043 1.4784 cute_dsl 1.1253
64 7168 4608 0.009920 426.1954 1.7722 0.010575 399.7786 1.6623 0.010688 395.5706 1.6448 0.008352 506.2091 2.1049 cute_dsl 1.1877
64 7168 5120 0.010400 451.6943 1.8684 0.011008 426.7460 1.7652 0.011168 420.6322 1.7399 0.008944 525.2259 2.1726 cute_dsl 1.1628
64 8192 1024 0.005504 195.0839 0.9585 0.005760 186.4135 0.9159 0.005760 186.4135 0.9159 0.004992 215.0925 1.0568 cute_dsl 1.1026
64 8192 2048 0.007200 298.2616 1.3198 0.007744 277.3094 1.2271 0.007424 289.2623 1.2800 0.006336 338.9337 1.4998 cute_dsl 1.1364
64 8192 3584 0.009120 412.0720 1.7372 0.009584 392.1219 1.6531 0.009488 396.0894 1.6698 0.007744 485.2914 2.0459 cute_dsl 1.1777
64 8192 4096 0.009696 442.9628 1.8520 0.010208 420.7452 1.7591 0.011600 370.2558 1.5480 0.008560 501.7485 2.0978 cute_dsl 1.1327
64 8192 7168 0.013520 555.9314 2.2661 0.014096 533.2146 2.1735 0.016880 445.2721 1.8151 0.011440 657.0099 2.6782 cute_dsl 1.1818
64 8192 8192 0.015024 571.7475 2.3206 0.015424 556.9200 2.2604 0.015904 540.1116 2.1922 0.012032 713.9241 2.8977 cute_dsl 1.2487
64 8192 14336 0.022831 658.4055 2.6379 0.025360 592.7597 2.3749 0.025920 579.9531 2.3236 0.019504 770.7335 3.0880 cute_dsl 1.1706
64 8192 28672 0.039552 760.1424 3.0190 0.039599 759.2210 3.0154 0.047808 628.8649 2.4976 0.032192 933.9206 3.7092 cute_dsl 1.2286
64 9216 7168 0.014256 593.1339 2.4158 0.014576 580.1123 2.3627 0.015520 544.8271 2.2190 0.011920 709.3722 2.8892 cute_dsl 1.1960
64 10240 8192 0.016432 653.4456 2.6482 0.016992 631.9102 2.5610 0.018287 587.1452 2.3795 0.013728 782.1546 3.1699 cute_dsl 1.1970
256 512 7168 0.010096 186.1181 0.2986 0.009504 197.7113 0.3172 0.010896 172.4530 0.2767 0.007776 241.6471 0.3877 cute_dsl 1.2222
256 896 1024 0.004368 107.5463 0.2401 0.004432 105.9932 0.2366 0.005216 90.0617 0.2010 0.004000 117.4405 0.2621 cute_dsl 1.0920
256 1024 7168 0.008736 430.1850 0.5851 0.009824 382.5424 0.5203 0.011584 324.4353 0.4413 0.007984 470.7035 0.6403 cute_dsl 1.0942
256 4608 7168 0.010592 1596.6233 1.8686 0.011456 1476.2076 1.7276 0.015552 1087.4121 1.2726 0.010464 1616.1538 1.8914 cute_dsl 1.0122
256 7168 256 0.004160 225.8471 1.1106 0.004512 208.2279 1.0240 0.004560 206.0360 1.0132 0.004128 227.5979 1.1193 cute_dsl 1.0078
256 7168 512 0.004640 404.9673 1.2006 0.004896 383.7925 1.1378 0.005056 371.6472 1.1018 0.004479 419.4772 1.2436 cute_dsl 1.0358
256 7168 2304 0.007040 1201.0961 1.7361 0.007744 1091.9056 1.5783 0.007904 1069.8022 1.5464 0.007248 1166.6276 1.6863 cudnn 0.9713
256 7168 4608 0.009184 1841.4018 2.2621 0.010000 1691.1434 2.0775 0.011184 1512.1096 1.8576 0.009408 1797.5589 2.2082 cudnn 0.9762
256 9216 7168 0.012736 2655.6900 3.0360 0.013728 2463.7870 2.8166 0.016224 2084.7428 2.3833 0.013056 2590.5995 2.9616 cudnn 0.9755
512 896 5120 0.007424 632.7614 0.6091 0.008144 576.8198 0.5553 0.009248 507.9607 0.4890 0.006800 690.8265 0.6650 cute_dsl 1.0918
512 1280 8192 0.011552 929.4857 0.7489 0.011072 969.7813 0.7813 0.012640 849.4793 0.6844 0.009216 1165.0844 0.9387 cute_dsl 1.2014
512 1792 5120 0.007856 1195.9319 0.9844 0.008688 1081.4043 0.8901 0.009680 970.5827 0.7989 0.007552 1244.0732 1.0240 cute_dsl 1.0403
512 2560 8192 0.010784 1991.3609 1.4099 0.011824 1816.2074 1.2859 0.015648 1372.3694 0.9716 0.011264 1906.5018 1.3498 cudnn 0.9574
512 3584 5120 0.008928 2104.6687 1.5855 0.009856 1906.5018 1.4363 0.011872 1582.7562 1.1924 0.009088 2067.6146 1.5576 cudnn 0.9824
512 5120 640 0.005568 602.6299 1.2653 0.006048 554.8021 1.1649 0.008352 401.7533 0.8435 0.005344 627.8898 1.3183 cute_dsl 1.0419
512 5120 1024 0.006016 892.4051 1.3508 0.006496 826.4638 1.2510 0.007296 735.8428 1.1138 0.006512 824.4332 1.2479 cudnn 0.9238
512 5120 1280 0.006592 1018.0350 1.3421 0.006976 961.9963 1.2683 0.009600 699.0507 0.9216 0.006464 1038.1941 1.3687 cute_dsl 1.0198
512 5120 2048 0.007696 1395.1947 1.4306 0.008192 1310.7200 1.3440 0.009888 1085.9039 1.1135 0.007808 1375.1816 1.4101 cudnn 0.9857
512 5120 2560 0.008512 1576.8060 1.4629 0.009504 1412.2236 1.3102 0.011056 1213.9809 1.1263 0.008672 1547.7137 1.4359 cudnn 0.9815
512 5120 4096 0.010512 2042.8878 1.5960 0.011616 1848.7290 1.4443 0.014271 1504.7358 1.1756 0.010944 1962.2475 1.5330 cudnn 0.9605
512 5120 5120 0.011696 2295.1048 1.6810 0.012640 2123.6982 1.5554 0.016576 1619.4224 1.1861 0.012320 2178.8592 1.5958 cudnn 0.9494
512 5120 8192 0.015616 2750.3633 1.8130 0.016400 2618.8825 1.7263 0.025568 1679.8214 1.1073 0.016880 2544.4119 1.6772 cudnn 0.9251
512 5120 16384 0.027232 3154.3532 1.8868 0.026720 3214.7959 1.9229 0.044480 1931.1903 1.1551 0.024576 3495.2533 2.0907 cute_dsl 1.0872
512 7168 5120 0.012528 2999.7577 2.1552 0.013344 2816.3192 2.0234 0.018048 2082.2786 1.4961 0.011936 3148.5392 2.2621 cute_dsl 1.0496
512 8192 1024 0.006640 1293.6648 1.9345 0.007104 1209.1687 1.8081 0.008224 1044.4959 1.5619 0.006304 1362.6165 2.0376 cute_dsl 1.0533
512 8192 2048 0.008496 2022.1127 2.0364 0.009632 1783.6243 1.7963 0.011232 1529.5468 1.5404 0.009120 1883.7576 1.8971 cudnn 0.9316
512 8192 3584 0.010608 2834.1602 2.2611 0.011616 2588.2207 2.0649 0.014976 2007.5301 1.6016 0.011872 2532.4100 2.0204 cudnn 0.8935
512 8192 4096 0.011392 3016.1287 2.3011 0.012416 2767.3758 2.1113 0.016064 2138.9279 1.6319 0.012752 2694.4588 2.0557 cudnn 0.8934
512 8192 7168 0.015632 3846.5674 2.5322 0.017664 3404.0728 2.2409 0.024368 2467.6123 1.6244 0.016704 3599.7092 2.3697 cudnn 0.9358
512 8192 8192 0.017087 4021.6226 2.5773 0.018816 3652.1831 2.3406 0.026976 2547.4302 1.6326 0.020208 3400.6917 2.1794 cudnn 0.8456
512 8192 14336 0.027808 4324.6218 2.5453 0.030432 3951.7312 2.3258 0.046752 2572.2768 1.5139 0.029008 4145.7213 2.4400 cudnn 0.9586
512 8192 28672 0.054192 4438.2597 2.4574 0.050496 4763.1133 2.6372 0.086704 2774.0307 1.5359 0.046480 5174.6037 2.8651 cute_dsl 1.0864
512 10240 8192 0.026336 3261.6702 2.0704 0.027472 3126.7962 1.9848 0.036784 2335.2367 1.4823 0.024880 3452.5461 2.1916 cute_dsl 1.0585
1024 512 7168 0.009184 818.4008 0.7136 0.009968 754.0322 0.6575 0.011904 631.4006 0.5505 0.008144 922.9117 0.8047 cute_dsl 1.1277
1024 896 1024 0.004384 428.6150 0.6428 0.004768 394.0957 0.5910 0.005088 369.3098 0.5539 0.004192 448.2462 0.6722 cute_dsl 1.0458
1024 1024 7168 0.009376 1603.2834 1.0065 0.010416 1443.2014 0.9060 0.011968 1256.0483 0.7885 0.009008 1668.7817 1.0476 cute_dsl 1.0409
1024 4608 7168 0.015392 4394.8632 1.9245 0.016800 4026.5318 1.7632 0.022144 3054.8110 1.3377 0.016192 4177.7257 1.8294 cudnn 0.9506
1024 7168 256 0.005680 661.6367 2.7691 0.006128 613.2664 2.5667 0.010816 347.4571 1.4542 0.005472 686.7866 2.8744 cute_dsl 1.0380
1024 7168 512 0.006368 1180.3067 2.6346 0.006880 1092.4699 2.4385 0.009472 793.5170 1.7712 0.006096 1232.9713 2.7522 cute_dsl 1.0446
1024 7168 4608 0.017616 3840.0167 1.9048 0.018208 3715.1656 1.8428 0.022192 3048.2036 1.5120 0.016288 4153.1026 2.0601 cute_dsl 1.0815
1024 9216 7168 0.026128 5178.0262 2.1270 0.028367 4769.2419 1.9591 0.040896 3308.1834 1.3589 0.026768 5054.2241 2.0762 cudnn 0.9761

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Added "cute-dsl" as a supported FP4 backend, integrated into FP4 autotuning and runtime selection (PDL-enabled by default).
    • Added an SM100 persistent block-scaled dense GEMM kernel to accelerate FP4 workloads on SM100/SM103.
  • Tests

    • Updated FP4 tests to include "cute-dsl" and added GPU/gating checks for nvfp4, 128x4 layout, and SM100/SM103.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds "cute-dsl" as a new mm_fp4 FP4 GEMM backend: benchmarks and tests updated; mm_fp4 dispatch extended with CuTe DSL availability checks, runner factory entry, and a kernel cache; and a new SM100 block‑scaled persistent GEMM kernel was added.

Changes

Cohort / File(s) Summary
Benchmarks & Tests
benchmarks/routines/gemm.py, tests/gemm/test_mm_fp4.py
Added "cute-dsl" to backend CLI parsing, autotune gating, and test parameterization; test flow includes GPU capability and 128x4 SF layout gating for "cute-dsl".
FP4 GEMM Dispatch & Integration
flashinfer/gemm/gemm_base.py
Integrated "cute-dsl" into mm_fp4 API and literals: added availability check _cute_dsl_gemm_fp4_requirement, runner factory _cute_dsl_gemm_fp4_runner, module kernel cache _CUTE_DSL_MM_FP4_KERNEL_CACHE, updated signatures to accept enable_pdl, and wired "cute-dsl" runner entry.
SM100 Block‑Scaled GEMM Kernel
flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py
Added Sm100BlockScaledPersistentDenseGemmKernel: a persistent‑tile, block‑scaled dense GEMM kernel for SM100 with configuration, validation, staged data movement, PDL-aware epilogue, wrapper, and callable launch interfaces.

Sequence Diagram

sequenceDiagram
    participant User as User
    participant MM as mm_fp4
    participant Disp as Dispatcher
    participant Req as Requirement\r\n(_cute_dsl_gemm_fp4_requirement)
    participant Run as Runner\r\n(_cute_dsl_gemm_fp4_runner)
    participant Cache as Kernel\r\nCache
    participant Kernel as CuTeDSL\r\nKernel

    User->>MM: mm_fp4(..., backend="cute-dsl", enable_pdl=...)
    MM->>Disp: select backend runner
    Disp->>Req: validate availability & constraints
    alt invalid
        Req-->>MM: raise/skip
    else valid
        Disp->>Run: create/obtain runner
        Run->>Cache: lookup compiled kernel by config
        alt cached
            Cache-->>Run: return kernel
        else not cached
            Run->>Kernel: compile kernel
            Kernel-->>Cache: store compiled kernel
            Cache-->>Run: return kernel
        end
        Run-->>Disp: runner instance
        Disp->>Kernel: execute kernel with tensors
        Kernel-->>User: results
    end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

op: gemm

Suggested reviewers

  • nvmbreughe
  • jiahanc
  • Anerudhan
  • yongwww
  • cyx-6
  • jimmyzho

Poem

🐰 I hopped through kernels, quick and bright,
Cute‑DSL brought FP4 into light.
Tiles persist, scale factors align,
Cached and ready — kernels shine.
Rabbity cheers, a carrot byte delight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: cute dsl mmfp4 for blackwell' clearly describes the main change: adding CuTeDSL support for mm_fp4 on Blackwell GPUs.
Description check ✅ Passed The description includes related issue reference (#2466), provides context on file origins (TensorRT-LLM, CUTLASS), explains the main changes (integration of CuTeDSL backend, wrapper logic, test/benchmark updates), and includes comprehensive performance data tables. However, the PR checklist items are not checked off.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @nv-yunzheq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces cute_dsl as an additional backend for mm_fp4 operations, aiming to leverage advanced GPU architectural features for improved performance. The integration includes a new kernel specifically designed for SM100 GPUs, with a placeholder for SM103. Benchmarking data indicates that the cute_dsl backend can offer significant speedups over existing cudnn, cutlass, and trtllm implementations, particularly when autotuning is enabled.

Highlights

  • New Backend Integration: The cute_dsl backend has been integrated for mm_fp4 operations, leveraging advanced GPU architectural features.
  • SM100 Kernel Implementation: A new SM100 block-scaled persistent dense GEMM kernel has been implemented to support the cute_dsl backend.
  • Performance Improvements: Performance benchmarks indicate that the cute_dsl backend often outperforms existing cudnn, cutlass, and trtllm implementations, especially with autotuning enabled.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • benchmarks/routines/gemm.py
    • Added cute_dsl as a selectable kernel backend for testing
    • Included cute_dsl in the list of autotune-supported backends
  • flashinfer/gemm/gemm_base.py
    • Updated mm_fp4 function signatures and literal types to include cute_dsl as a valid backend option
    • Introduced a new enable_pdl parameter for the cute_dsl backend
    • Implemented _cute_dsl_gemm_fp4_requirement and _cute_dsl_gemm_fp4_runner functions to handle cute_dsl specific logic, including kernel caching and dynamic compilation
    • Added cute_dsl to the backend_to_runner_factory mapping
  • flashinfer/gemm/kernels/cute_dsl_gemm_utils.py
    • Added a new file containing shared utilities for CuTe DSL dense block-scaled GEMM kernels
    • Included PDL (Programmatic Dependent Launch) helpers (griddepcontrol_wait, griddepcontrol_launch_dependents)
    • Provided a custom make_ptr utility for CuTe DSL's JIT compilation
    • Implemented PipelineTmaUmma and PipelineUmmaAsync classes for managing asynchronous data transfer and accumulation
  • flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py
    • Added a new file implementing the Sm100BlockScaledPersistentDenseGemmKernel
    • This kernel supports batched matrix multiplication with FP4 data types, leveraging Blackwell GPU features like persistent tile scheduling and warp specialization
    • Includes logic for TMA (Tensor Memory Access) producers, UMMA (Universal Matrix Multiply Accumulate) consumers, and shared memory management
  • tests/gemm/test_mm_fp4.py
    • Extended the _test_mm_fp4 function to include cute_dsl in the backend test parameters
    • Added specific pytest.skip conditions for the cute_dsl backend, ensuring it only runs with nvfp4, 128x4 SF layout, and SM100/SM103 GPUs
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@nv-yunzheq nv-yunzheq changed the title Cute dsl mmfp4 feat: cute dsl mmfp4 Feb 11, 2026
@nv-yunzheq nv-yunzheq changed the title feat: cute dsl mmfp4 feat: cute dsl mmfp4 for blackwell Feb 11, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates cute_dsl as a new backend for mm_fp4, which is a significant and valuable addition. The changes are well-structured, introducing new, high-performance kernels ported from NVIDIA's libraries and integrating them consistently with the existing backend infrastructure. The new code is complex but appears to be of high quality. I've identified one potential issue in the autotuner's tactic generation logic where an alignment check seems to be incorrect, which could lead to suboptimal kernel selection. Overall, this is an excellent contribution that should improve FP4 GEMM performance.

Comment on lines +2912 to +2913
if swap_ab and not m_aligned:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The alignment check for the output matrix C when swap_ab is true appears to be incorrect. When swap_ab is true, the kernel computes B.T @ A.T, and the output is effectively a column-major matrix of shape (n, m). The contiguous dimension in memory is along the columns, which corresponds to the problem's n dimension. Therefore, the alignment check should be on n (n_aligned), not m (m_aligned). This incorrect pruning might exclude valid and potentially optimal kernel configurations.

Suggested change
if swap_ab and not m_aligned:
continue
if swap_ab and not n_aligned:
continue

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@flashinfer/gemm/gemm_base.py`:
- Around line 3068-3173: The cache key used to index _CUTE_DSL_KERNEL_CACHE must
include the device identity to avoid reusing device-specific compiled kernels
across GPUs; modify the construction of cache_key (the tuple currently
containing sf_vec_size, mma_tiler_mn, cluster_shape_mn, swap_ab, use_prefetch,
kernel_type, use_tma_store, enable_pdl, out_dtype) to also incorporate the
executing device (derive from kernel_a.device — include device.type and
device.index (or a stable sentinel like -1 if index is None)), and use that
augmented cache_key when reading/writing _CUTE_DSL_KERNEL_CACHE for
compiled_gemm and max_active_clusters so the lookup/store around compiled_gemm
and max_active_clusters becomes device-aware.
- Around line 3176-3192: The kernel assumes row-major memory when swap_ab=True
but launch_out is set to the non-contiguous view out.T; change the launch path
so the kernel receives a contiguous buffer with the expected layout: when
swap_ab is True, allocate a temporary contiguous tensor with the row-major
layout (or call out.clone().contiguous()) into which the kernel will write (this
is the launch_out passed to the kernel), then after the kernel completes copy
the results back into the original out via the appropriate transpose (e.g.,
out.copy_(temp.T)) and free the temp; alternatively, ensure out is originally
allocated with the layout expected by cute.make_ordered_layout so no transpose
view is used. Ensure this change is applied around the launch_out assignment and
kernel invocation that uses swap_ab and interacts with cute.make_ordered_layout.

In `@flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py`:
- Around line 1655-1658: The docstring documents a non-existent parameter "sepi"
alongside tTR_rC, tidx, and sC; remove the stale "sepi (cute.Tensor):" entry
from the function's docstring (the block describing tTR_rC, tidx, sC) so the
parameter list matches the actual function signature and leave only real
parameters (e.g., tTR_rC, tidx, sC).
- Around line 2017-2021: Rename the helper function
check_contigous_16B_alignment to check_contiguous_16B_alignment and update all
call sites that invoke it (the three places currently calling
check_contigous_16B_alignment) to use the new name; ensure the function
signature (dtype, is_mode0_major, tensor_shape) remains unchanged so callers
still pass the same arguments and behavior is preserved.
🧹 Nitpick comments (4)
flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py (4)

648-648: Nit: prefix unused unpacked variables with _.

bidy and bidz are never referenced. Prefixing them with _ silences linter warnings and signals intent.

Proposed fix
-        bidx, bidy, bidz = cute.arch.block_idx()
+        bidx, _bidy, _bidz = cute.arch.block_idx()

1677-1719: Unused parameter tidx.

tidx is accepted but never referenced inside epilog_gmem_copy_and_partition. If it's kept for API symmetry with the other epilog_*_copy_and_partition methods, consider documenting that intent. Otherwise, remove it.


1721-1829: Unused parameters a_major_mode and b_major_mode in _compute_stages.

These are passed through from the caller but never referenced in the method body. They appear to be placeholders—consider removing them or adding a comment if they're reserved for future heuristics.

Also note: num_ab_stage (line 1816) has no lower-bound clamp. If shared memory is under-provisioned for a given configuration, this could yield ≤ 1 stages, breaking double-buffered pipeline semantics. The upstream can_implement validation likely guards against this in practice, but a defensive max(num_ab_stage, 2) would be safer.


1923-1947: Unused parameters c_dtype and c_major in is_valid_layouts.

These parameters are accepted but never checked. If all C layouts are valid, remove them from the signature (and update callers). If C-layout validation is planned, consider adding a # TODO to track it.

Comment on lines +3068 to +3173
cache_key = (
sf_vec_size,
mma_tiler_mn,
cluster_shape_mn,
swap_ab,
use_prefetch,
kernel_type,
use_tma_store,
enable_pdl,
out_dtype,
)

if cache_key not in _CUTE_DSL_KERNEL_CACHE:
# Create kernel instance
if kernel_type == "sm103" and Sm103Kernel is not None:
gemm = Sm103Kernel( # type: ignore[assignment]
sf_vec_size,
mma_tiler_mn,
cluster_shape_mn,
use_tma_store,
enable_pdl,
)
else:
gemm = Sm100BlockScaledPersistentDenseGemmKernel( # type: ignore[assignment]
sf_vec_size,
mma_tiler_mn,
cluster_shape_mn,
use_prefetch,
enable_pdl,
)

# Create CuTe pointers for compilation
a_ptr = make_ptr(
cutlass.Float4E2M1FN,
kernel_a.data_ptr(),
cute.AddressSpace.gmem,
32,
)
b_ptr = make_ptr(
cutlass.Float4E2M1FN,
kernel_b.data_ptr(),
cute.AddressSpace.gmem,
32,
)
a_sf_ptr = make_ptr(
cutlass.Float8E4M3FN,
kernel_a_sf.data_ptr(),
cute.AddressSpace.gmem,
16,
)
b_sf_ptr = make_ptr(
cutlass.Float8E4M3FN,
kernel_b_sf.data_ptr(),
cute.AddressSpace.gmem,
16,
)
c_ptr = make_ptr(
c_cutlass_dtype, out.data_ptr(), cute.AddressSpace.gmem, 16
)

# Alpha: ensure 1-dim shape [1] for consistent TVM FFI compilation
if alpha_tensor is not None:
alpha_compile = (
alpha_tensor.reshape(1)
if alpha_tensor.dim() != 1
else alpha_tensor
)
alpha_cute = cute.runtime.from_dlpack(alpha_compile)
else:
alpha_cute = cute.runtime.from_dlpack(
torch.tensor([1.0], dtype=torch.float32, device=a.device)
)

# Get max active clusters
from flashinfer.cute_dsl.utils import get_max_active_clusters

max_active_clusters = get_max_active_clusters(
cluster_shape_mn[0] * cluster_shape_mn[1]
)

# Create a fake stream for compilation
stream = cute.runtime.make_fake_stream(use_tvm_ffi_env_stream=True)

compiled_gemm = cute.compile(
gemm.wrapper,
kernel_m,
kernel_n,
real_k,
sf_m,
sf_n,
sf_k,
batch_size,
a_ptr,
b_ptr,
a_sf_ptr,
b_sf_ptr,
c_ptr,
alpha_cute,
max_active_clusters,
stream,
swap_ab,
options="--opt-level 2 --enable-tvm-ffi",
)

_CUTE_DSL_KERNEL_CACHE[cache_key] = (compiled_gemm, max_active_clusters)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the file structure and locate the relevant code
head -20 flashinfer/gemm/gemm_base.py
wc -l flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 795


🏁 Script executed:

# Look at the specific section mentioned in the review
sed -n '3060,3180p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 4434


🏁 Script executed:

# Find where _CUTE_DSL_KERNEL_CACHE is defined and used
rg "_CUTE_DSL_KERNEL_CACHE" -n flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 343


🏁 Script executed:

# Check if sm_version is used or available in this context
sed -n '3000,3200p' flashinfer/gemm/gemm_base.py | rg -n "sm_version|device|SM"

Repository: flashinfer-ai/flashinfer

Length of output: 305


🏁 Script executed:

# Look for the broader function context to understand the function signature
sed -n '3000,3100p' flashinfer/gemm/gemm_base.py | head -100

Repository: flashinfer-ai/flashinfer

Length of output: 3747


🏁 Script executed:

# Check the get_max_active_clusters function
rg "def get_max_active_clusters" -A 10 flashinfer/

Repository: flashinfer-ai/flashinfer

Length of output: 737


🏁 Script executed:

# Look at kernel_type selection logic and if it's based on SM version
sed -n '3000,3070p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 2689


🏁 Script executed:

# Check if enable_pdl and out_dtype are derived from device/SM context
sed -n '2980,3030p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 2247


🏁 Script executed:

# Search for any SM/device version detection or multi-device handling
rg "device\." flashinfer/gemm/gemm_base.py | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Look for where kernel_type is determined (before the forward call)
sed -n '2900,3000p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 4439


🏁 Script executed:

# Find where sm_version is defined in this method/class context
sed -n '2850,2920p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 2423


🏁 Script executed:

# Check the broader class context for where enable_pdl and sm_version come from
sed -n '2780,2850p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 2532


🏁 Script executed:

# Look at get_hardware_info to see what it returns
rg "def get_hardware_info" -A 15 flashinfer/

Repository: flashinfer-ai/flashinfer

Length of output: 1082


🏁 Script executed:

# Check if max_active_clusters is SM-version-dependent
rg "get_max_active_clusters" -B 5 -A 5 flashinfer/cute_dsl/

Repository: flashinfer-ai/flashinfer

Length of output: 1263


🏁 Script executed:

# Verify if the cache is truly module-level and shared across function calls
sed -n '2775,2850p' flashinfer/gemm/gemm_base.py | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 2695


🏁 Script executed:

# Check if there's any device context management or stream handling that might isolate cache usage
sed -n '3100,3180p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 2978


🏁 Script executed:

# Look for multi-GPU test cases or documentation
find flashinfer -name "*.py" -type f | xargs grep -l "multi.*gpu\|device.*index\|cuda.device" | head -5

Repository: flashinfer-ai/flashinfer

Length of output: 213


🏁 Script executed:

# Check if kernel_type "sm100"/"sm103" is truly the only SM-specific differentiator
sed -n '2810,2830p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 827


🏁 Script executed:

# Check how the runner is instantiated and if it's per-device or shared
sed -n '2778,2790p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 444


🏁 Script executed:

# Look for where _cute_dsl_gemm_fp4_runner is called to understand scope
rg "_cute_dsl_gemm_fp4_runner" -B 3 -A 3 flashinfer/

Repository: flashinfer-ai/flashinfer

Length of output: 835


🏁 Script executed:

# Verify the exact scope where the class is created and cached
rg "CuteDSLFp4GemmRunner" -B 5 -A 5 flashinfer/

Repository: flashinfer-ai/flashinfer

Length of output: 1324


🏁 Script executed:

# Check if there's any per-device runner instance management
rg "get_device_properties\|device.*index\|device.*type" flashinfer/gemm/gemm_base.py | head -10

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check how the runner is actually used and if devices could differ
rg "CuteDSLFp4GemmRunner\|cute_dsl.*runner" -B 5 -A 5 flashinfer/gemm/ | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Look for any device context management in forward method
sed -n '3040,3080p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1457


🏁 Script executed:

# Check if there's ever a case where the function is called with different devices but same SM version
sed -n '3200,3250p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1948


🏁 Script executed:

# Verify what happens when cache is reused - does it handle different devices correctly?
sed -n '3170,3220p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1696


🏁 Script executed:

# Check how TVM FFI stream and compilation work together
rg "make_fake_stream\|use_tvm_ffi_env_stream" -B 3 -A 3 flashinfer/

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check if get_hardware_info is truly cached globally and not per-device
sed -n '1,50p' flashinfer/cute_dsl/utils.py

Repository: flashinfer-ai/flashinfer

Length of output: 1545


🏁 Script executed:

# Examine the runner factory pattern and whether runners are per-device or shared
sed -n '3250,3300p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1500


🏁 Script executed:

# Check if there's any mechanism to invalidate/update the cache per device
rg "KERNEL_CACHE\|clear\|device" flashinfer/gemm/gemm_base.py | grep -i cache

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check if there's any device context isolation in the execution
sed -n '3195,3230p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1051


🏁 Script executed:

# Look at how runners are created and cached at the module level
sed -n '3260,3280p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 586


🏁 Script executed:

# Check if cute.compile produces portable or device-specific binaries
rg "cute.compile\|TVM.*compile" -B 2 -A 2 flashinfer/ | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Verify if the cache could be shared across different device instantiations
sed -n '3245,3260p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 767


🏁 Script executed:

# Check where the runner is called and understand the call context better
sed -n '3175,3195p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 898


🏁 Script executed:

# Look at the full forward method signature to confirm sm_version availability
sed -n '3030,3050p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 621


🏁 Script executed:

# Check what variables are in scope within the forward method
sed -n '3020,3080p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 2012


🏁 Script executed:

# Confirm that we have access to device info and SM version in the forward method
sed -n '2810,2820p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 465


🏁 Script executed:

# Verify that a.device is accessible in forward and has the needed attributes
sed -n '3035,3045p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 384


🏁 Script executed:

# Check if sm_version is in the closure scope of the forward method
sed -n '2810,2825p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 626


🏁 Script executed:

# Confirm the exact closure and class structure
sed -n '2845,2860p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 623


Cache key must include device identity to prevent cross-device reuse in multi-GPU setups.

_CUTE_DSL_KERNEL_CACHE lacks device-specific information in its cache key. The current 9-tuple key includes kernel parameters but omits device identity; compiled GEMM binaries and max_active_clusters are device-specific. In multi-GPU or heterogeneous SM configurations, this allows incorrect reuse of kernels compiled for different devices. Include device type and index in the cache key:

🔧 Suggested fix (device-aware cache key)
             # Cache key for compiled kernel
+            device_key = (a.device.type, a.device.index)
             cache_key = (
+                device_key,
                 sf_vec_size,
                 mma_tiler_mn,
                 cluster_shape_mn,
                 swap_ab,
                 use_prefetch,
                 kernel_type,
                 use_tma_store,
                 enable_pdl,
                 out_dtype,
             )
🤖 Prompt for AI Agents
In `@flashinfer/gemm/gemm_base.py` around lines 3068 - 3173, The cache key used to
index _CUTE_DSL_KERNEL_CACHE must include the device identity to avoid reusing
device-specific compiled kernels across GPUs; modify the construction of
cache_key (the tuple currently containing sf_vec_size, mma_tiler_mn,
cluster_shape_mn, swap_ab, use_prefetch, kernel_type, use_tma_store, enable_pdl,
out_dtype) to also incorporate the executing device (derive from kernel_a.device
— include device.type and device.index (or a stable sentinel like -1 if index is
None)), and use that augmented cache_key when reading/writing
_CUTE_DSL_KERNEL_CACHE for compiled_gemm and max_active_clusters so the
lookup/store around compiled_gemm and max_active_clusters becomes device-aware.

Comment on lines +3176 to +3192
# Handle output tensor for swap_ab
if swap_ab:
launch_out = out.T
else:
launch_out = out

# Prepare alpha: ensure it is always a 1-dim tensor with shape [1].
# mm_fp4 may pass alpha as a 0-dim scalar tensor or a 1-dim tensor;
# the TVM FFI compiled kernel requires a consistent 1-dim shape.
if alpha_tensor is None:
alpha_for_launch = torch.tensor(
[1.0], dtype=torch.float32, device=a.device
)
elif alpha_tensor.dim() == 0:
alpha_for_launch = alpha_tensor.unsqueeze(0)
else:
alpha_for_launch = alpha_tensor.reshape(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the context around lines 3176-3192 in the file
head -3200 flashinfer/gemm/gemm_base.py | tail -100

Repository: flashinfer-ai/flashinfer

Length of output: 3737


🏁 Script executed:

# Get full view of the file structure
wc -l flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 101


🏁 Script executed:

# Search for where launch_out is used/passed to kernel
rg "launch_out" flashinfer/gemm/gemm_base.py -A 5 -B 5

Repository: flashinfer-ai/flashinfer

Length of output: 875


🏁 Script executed:

# Search for c_major parameter and layout handling
rg "c_major" flashinfer/gemm/gemm_base.py -B 3 -A 3

Repository: flashinfer-ai/flashinfer

Length of output: 1375


🏁 Script executed:

# Look for swap_ab logic and how output is handled
rg "swap_ab" flashinfer/gemm/gemm_base.py -B 5 -A 10

Repository: flashinfer-ai/flashinfer

Length of output: 10194


🏁 Script executed:

# Search for contiguity checks or .contiguous() calls in the file
rg "contiguous|is_contiguous" flashinfer/gemm/gemm_base.py -B 2 -A 2

Repository: flashinfer-ai/flashinfer

Length of output: 2144


🏁 Script executed:

# Look at the kernel compilation code more carefully
sed -n '3100,3200p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 3771


🏁 Script executed:

# Search for cute kernel implementation or documentation about swap_ab
fd "cute_dsl" flashinfer/ --type f | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 114


🏁 Script executed:

# Check the cute.compile signature and kernel handling
rg "def compile" flashinfer/ -B 2 -A 5 --type py | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 1978


🏁 Script executed:

# Look for how output tensor is allocated/prepared before the gemm call
sed -n '3000,3100p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 3781


🏁 Script executed:

# Find the complete compiled_gemm call
sed -n '3190,3210p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 699


🏁 Script executed:

# Search for any contiguity handling for launch_out specifically
rg "launch_out" flashinfer/gemm/gemm_base.py -B 3 -A 15

Repository: flashinfer-ai/flashinfer

Length of output: 1370


🏁 Script executed:

# Check if there's a contiguous() call or conversion for transposed tensors in swap_ab handling
rg "swap_ab.*contiguous|contiguous.*swap_ab" flashinfer/gemm/gemm_base.py -B 5 -A 5

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Look at how kernel_a and kernel_b handle transposition (they use .T too)
sed -n '3050,3080p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1165


🏁 Script executed:

# Search for comments or documentation about swap_ab output handling
rg "swap_ab" flashinfer/gemm/gemm_base.py | grep -i "output\|layout\|contiguous"

Repository: flashinfer-ai/flashinfer

Length of output: 114


🏁 Script executed:

# Check if there's kernel implementation that shows how output is handled
find flashinfer/gemm/kernels -name "*.py" -o -name "*.cu" | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 319


🏁 Script executed:

# Look at the cute DSL utils to understand make_ptr and how pointers are handled
cat flashinfer/gemm/kernels/cute_dsl_gemm_utils.py

Repository: flashinfer-ai/flashinfer

Length of output: 22577


🏁 Script executed:

# Search for any documentation or tests related to swap_ab output handling
rg "swap_ab.*out|output.*swap" flashinfer/ --type py -B 2 -A 2 | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 847


🏁 Script executed:

# Look for test files related to swap_ab
fd "test.*gemm\|gemm.*test" flashinfer/ --type f | head -10

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Search for any validation or comments about swap_ab and output layout
rg "swap_ab.*output\|output.*layout.*swap" flashinfer/ -B 2 -A 2 --type py

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check if kernel actually needs contiguous output by examining kernel internals
sed -n '1,100p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 2353


🏁 Script executed:

# Look for any contiguity enforcement in the gemm call itself
rg "contiguous\|data_ptr" flashinfer/gemm/gemm_base.py | grep -A 2 -B 2 "launch_out\|swap_ab"

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Let me check if there's any handling of tensor contiguity for the output tensor
sed -n '3000,3250p' flashinfer/gemm/gemm_base.py | grep -A 5 -B 5 "out = \|\.contiguous\|\.T"

Repository: flashinfer-ai/flashinfer

Length of output: 2141


🏁 Script executed:

# Look for how output tensor is created/allocated
rg "out = torch\|torch.empty.*out" flashinfer/gemm/gemm_base.py -B 2 -A 2 | head -60

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check kernel implementation to understand if it handles non-contiguous output
sed -n '100,500p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | grep -i "layout\|stride\|contiguous\|c_major" -A 2 -B 2

Repository: flashinfer-ai/flashinfer

Length of output: 6676


🏁 Script executed:

# Search for documentation or comments about output layout handling
rg "output.*layout\|layout.*output" flashinfer/gemm/kernels/ -B 2 -A 2

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Look at the kernel call more carefully - what are the actual parameters being passed?
sed -n '3190,3215p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 764


🏁 Script executed:

# Check if kernel_a and kernel_b which also use .T have any special handling
rg "kernel_a.*data_ptr\|kernel_b.*data_ptr" flashinfer/gemm/gemm_base.py -B 3 -A 3

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Look for any stride information or layout handling in the kernel call
rg "stride\|layout" flashinfer/gemm/gemm_base.py | grep -i "launch_out\|output\|kernel_[ab]" -B 2 -A 2

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check if TVM FFI kernel expects strides or just pointers
rg "tvm\|ffi" flashinfer/gemm/gemm_base.py | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check actual kernel wrapper to see what parameters it receives
rg "def wrapper\|kernel_m, kernel_n" flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py -A 10 | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Search for tests of mm_fp4 with swap_ab to see expected behavior
fd "test" flashinfer/ --type f | xargs rg "mm_fp4\|swap_ab" 2>/dev/null | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check documentation or comments about expected output layout with swap_ab
rg "swap_ab.*True\|swap_ab.*output" flashinfer/ -B 3 -A 3 --type py | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Look at how out tensor is created before being passed to the kernel
sed -n '3020,3060p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1405


🏁 Script executed:

# Check if there's any usage of out after the gemm call that would reveal issues
sed -n '3200,3210p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 376


🏁 Script executed:

# Let me check if there's any issue tracking or documentation about this
rg "out\.T\|launch_out" flashinfer/ -B 2 -A 2 --type py | grep -v "\.pyc" | head -40

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check if the kernel compilation with c_major parameter means it handles layout differences
sed -n '3150,3170p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 697


🏁 Script executed:

# Look for any comments or evidence that transposed tensors work correctly
rg "transpos|non.contiguous" flashinfer/gemm/ -B 2 -A 2 --type py

Repository: flashinfer-ai/flashinfer

Length of output: 4905


🏁 Script executed:

# Check if kernel_a and kernel_b being transposed .T causes issues (they're passed same way as launch_out)
rg "kernel_a\|kernel_b" flashinfer/gemm/gemm_base.py | grep -c "\.T"

Repository: flashinfer-ai/flashinfer

Length of output: 69


🏁 Script executed:

# Check the wrapper function signature in the kernel to understand what parameters it receives
sed -n '500,700p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | grep -A 30 "def wrapper"

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Look for c_major or layout handling in the wrapper
rg "c_major\|layout.*swap\|swap.*layout" flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py -B 3 -A 3

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check if the kernel actually uses stride information or relies on layout from compilation
rg "stride\|shape\|layout" flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Search for any bug reports or issues related to swap_ab and output
rg "swap_ab.*bug\|issue.*swap_ab\|error.*swap_ab" flashinfer/ --type py

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Look for any assertions or validations about output layout
rg "assert\|ValueError.*layout\|check.*contiguous" flashinfer/gemm/gemm_base.py | grep -i "out\|layout\|swap"

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check if kernel actually writes to wrong memory locations by looking at how c_major affects kernel behavior
sed -n '1,150p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | grep -i "c_major\|layout\|output"

Repository: flashinfer-ai/flashinfer

Length of output: 128


🏁 Script executed:

# Look for the actual kernel wrapper implementation details
wc -l flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py

Repository: flashinfer-ai/flashinfer

Length of output: 128


🏁 Script executed:

# Find the wrapper function more carefully
rg "wrapper.*=" flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py -B 5 -A 20 | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Look for how the kernel handles output writes
rg "c_ref\|output.*write\|global.*store" flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py -B 3 -A 3 | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Search for the actual DSL wrapper definition
sed -n '1800,2192p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | head -100

Repository: flashinfer-ai/flashinfer

Length of output: 3952


🏁 Script executed:

# Search for any actual kernel implementation that writes to output
sed -n '1500,1800p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | grep -i "epilogue\|store\|c_ref" -B 2 -A 5

Repository: flashinfer-ai/flashinfer

Length of output: 3186


🏁 Script executed:

# Look for the actual wrapper function definition
sed -n '2050,2192p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py

Repository: flashinfer-ai/flashinfer

Length of output: 5773


🏁 Script executed:

# Check if there's any real-world validation that this works correctly
rg "test.*fp4\|fp4.*test" flashinfer/ -l --type py

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Look at how c_tensor is constructed with the layout - this is the key
sed -n '2120,2160p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py

Repository: flashinfer-ai/flashinfer

Length of output: 1759


🏁 Script executed:

# Check if cute.make_tensor respects actual strides or assumes them based on order
rg "make_tensor.*layout\|make_ordered_layout" flashinfer/ --type py -B 2 -A 2 | head -40

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Look for any documentation or warnings about non-contiguous tensors
rg "contiguous\|stride\|layout" flashinfer/gemm/gemm_base.py | grep -B 2 -A 2 "output\|out.*=" | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 50


Confirm kernel layout handling for transposed output. The kernel is JIT-compiled with swap_ab as a parameter and creates a tensor layout via cute.make_ordered_layout based on this flag. When swap_ab=True, it creates a row-major tensor layout (order=(0,1,2)) assuming strides match that layout. However, out.T is a non-contiguous view with different strides than row-major. Since the kernel receives only the base pointer (no stride metadata), it will interpret the memory using incorrect stride assumptions, causing incorrect writes. Either materialize the transposed output as a contiguous buffer and copy results back, or ensure the output tensor is allocated with the correct layout from the start.

🤖 Prompt for AI Agents
In `@flashinfer/gemm/gemm_base.py` around lines 3176 - 3192, The kernel assumes
row-major memory when swap_ab=True but launch_out is set to the non-contiguous
view out.T; change the launch path so the kernel receives a contiguous buffer
with the expected layout: when swap_ab is True, allocate a temporary contiguous
tensor with the row-major layout (or call out.clone().contiguous()) into which
the kernel will write (this is the launch_out passed to the kernel), then after
the kernel completes copy the results back into the original out via the
appropriate transpose (e.g., out.copy_(temp.T)) and free the temp;
alternatively, ensure out is originally allocated with the layout expected by
cute.make_ordered_layout so no transpose view is used. Ensure this change is
applied around the launch_out assignment and kernel invocation that uses swap_ab
and interacts with cute.make_ordered_layout.

Comment thread flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py
Comment on lines +2017 to +2021
def check_contigous_16B_alignment(dtype, is_mode0_major, tensor_shape):
major_mode_idx = 0 if is_mode0_major else 1
num_major_elements = tensor_shape[major_mode_idx]
num_contiguous_elements = 16 * 8 // dtype.width
return num_major_elements % num_contiguous_elements == 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: check_contigous_16B_alignmentcheck_contiguous_16B_alignment.

Minor typo in the inner helper name ("contigous" → "contiguous").

Proposed fix
-        def check_contigous_16B_alignment(dtype, is_mode0_major, tensor_shape):
+        def check_contiguous_16B_alignment(dtype, is_mode0_major, tensor_shape):

Update the three call sites on lines 2024–2026 accordingly.

🤖 Prompt for AI Agents
In `@flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py` around lines 2017 -
2021, Rename the helper function check_contigous_16B_alignment to
check_contiguous_16B_alignment and update all call sites that invoke it (the
three places currently calling check_contigous_16B_alignment) to use the new
name; ensure the function signature (dtype, is_mode0_major, tensor_shape)
remains unchanged so callers still pass the same arguments and behavior is
preserved.

bkryu
bkryu previously requested changes Feb 11, 2026
Copy link
Copy Markdown
Collaborator

@bkryu bkryu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nv-yunzheq , left a number a comments

Comment thread flashinfer/gemm/gemm_base.py Outdated
Comment thread benchmarks/routines/gemm.py Outdated
Comment thread flashinfer/gemm/kernels/cute_dsl_gemm_utils.py Outdated
Comment thread flashinfer/gemm/kernels/cute_dsl_gemm_utils.py Outdated
Comment thread flashinfer/gemm/kernels/cute_dsl_gemm_utils.py Outdated
Comment thread flashinfer/gemm/gemm_base.py
Comment thread flashinfer/gemm/gemm_base.py Outdated
Comment thread flashinfer/gemm/kernels/cute_dsl_gemm_utils.py Outdated
@nv-yunzheq
Copy link
Copy Markdown
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !311 has been created, and the CI pipeline #43830201 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Collaborator

@bkryu bkryu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating. No concerns on my end but will wait for a few more pairs of eyes before approving

@bkryu bkryu dismissed their stale review February 12, 2026 00:34

Dismissing "request for change" as requested changes have been made

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@flashinfer/gemm/gemm_base.py`:
- Around line 3392-3397: Docstring inconsistency: replace the underscore form
`cute_dsl` with the exact backend literal `"cute-dsl"` in backticks wherever it
appears in the docstring for the enable_pdl parameter (and the other occurrence
noted around line 3402) so the documentation matches the actual backend name;
update the text referencing enable_pdl to read `\"cute-dsl\"` (in backticks) to
ensure consistent naming across the docstring for the enable_pdl parameter and
related descriptive lines.

In `@flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py`:
- Around line 1807-1824: The computed stage counts can become non-positive;
after calculating num_ab_stage and refining num_c_stage (using smem_capacity,
occupancy, mbar_helpers_bytes, c_bytes, ab_bytes_per_stage, c_bytes_per_stage),
clamp them to safe minima (e.g., num_ab_stage = max(1, num_ab_stage) and
num_c_stage = max(2, num_c_stage)) or raise a clear exception if the tile
configuration is invalid; do this just before the return in the function that
computes stages so the pipeline never receives <=0 stages and include a short
error message if you choose to raise.
🧹 Nitpick comments (6)
flashinfer/gemm/gemm_base.py (3)

3194-3197: Avoid allocating a new tensor on every forward call when alpha is None.

torch.tensor([1.0], ...) allocates a new CUDA tensor on every invocation. For a hot GEMM path, consider caching the default alpha once (e.g., as an instance attribute or a module-level constant per device).

♻️ Suggested approach
+            # Cache a default alpha=1.0 tensor to avoid per-call allocation
+            _default_alpha_cache = {}
+
             # Prepare alpha: ensure it is always a 1-dim tensor with shape [1].
             if alpha_tensor is None:
-                alpha_for_launch = torch.tensor(
-                    [1.0], dtype=torch.float32, device=a.device
-                )
+                device = a.device
+                if device not in _default_alpha_cache:
+                    _default_alpha_cache[device] = torch.tensor(
+                        [1.0], dtype=torch.float32, device=device
+                    )
+                alpha_for_launch = _default_alpha_cache[device]

You could place _default_alpha_cache as a class attribute on CuteDSLFp4GemmRunner or a closure variable in _cute_dsl_gemm_fp4_runner.


2939-2950: Hoist get_device_properties call outside the loop.

torch.cuda.get_device_properties(a.device).multi_processor_count is called inside nested loops for each use_prefetch=True candidate. Move it before the loop to avoid repeated lookups.

♻️ Suggested change

Add before the for mma_tiler_mn loop (around line 2905):

sm_count = torch.cuda.get_device_properties(a.device).multi_processor_count

Then replace lines 2945-2947:

-                                sm_count = torch.cuda.get_device_properties(
-                                    a.device
-                                ).multi_processor_count

2808-2821: Noted: SM103 kernel disabled with clear TODO.

The commented-out SM103 import with the explanatory TODO and the explicit Sm103Kernel = None sentinel is clear. Consider tracking this with a GitHub issue so it doesn't get lost.

Would you like me to open an issue to track re-enabling the SM103 kernel once the cutlass-dsl package supports SM103MmaMXF4Op?

flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py (3)

1672-1678: tidx parameter is unused.

tidx is accepted but never referenced in the method body. If it's kept for API consistency with sibling epilog_* methods, consider prefixing with underscore (_tidx) to signal intent.


1918-1942: c_dtype and c_major parameters are unused.

These are accepted but never referenced in the validation logic. If they're placeholders for future constraints, consider adding a brief comment or prefixing with underscore.


1968-1968: Lambda assigned to a variable — prefer a def (Ruff E731).

Proposed fix
-        _is_power_of_2 = lambda x: x > 0 and (x & (x - 1)) == 0
+        def _is_power_of_2(x):
+            return x > 0 and (x & (x - 1)) == 0

Comment on lines +3392 to +3397
enable_pdl: bool
Whether to enable Programmatic Dependent Launch (PDL) for the ``cute_dsl``
backend, defaults to ``True``. PDL allows overlapping the tail of one kernel
with the start of the next for reduced launch latency. This parameter is
only used by the ``cute_dsl`` backend and is ignored by other backends.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor inconsistency: cute_dsl vs cute-dsl naming in docstring.

Line 3393 references cute_dsl (underscore) while the backend literal uses "cute-dsl" (hyphen). Consider using backticks with the exact backend string "cute-dsl" consistently to avoid user confusion.

📝 Suggested fix
     enable_pdl: bool
-        Whether to enable Programmatic Dependent Launch (PDL) for the ``cute_dsl``
+        Whether to enable Programmatic Dependent Launch (PDL) for the ``"cute-dsl"``
         backend, defaults to ``True``. PDL allows overlapping the tail of one kernel
         with the start of the next for reduced launch latency. This parameter is
-        only used by the ``cute_dsl`` backend and is ignored by other backends.
+        only used by the ``"cute-dsl"`` backend and is ignored by other backends.

Also on line 3402:

-    When cute_dsl backend is used, both a and b should be quantized with nvfp4_quantize using the 128x4 scale factor layout and do_shuffle=False (same as cudnn/cutlass).
+    When ``"cute-dsl"`` backend is used, both a and b should be quantized with nvfp4_quantize using the 128x4 scale factor layout and do_shuffle=False (same as cudnn/cutlass).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
enable_pdl: bool
Whether to enable Programmatic Dependent Launch (PDL) for the ``cute_dsl``
backend, defaults to ``True``. PDL allows overlapping the tail of one kernel
with the start of the next for reduced launch latency. This parameter is
only used by the ``cute_dsl`` backend and is ignored by other backends.
enable_pdl: bool
Whether to enable Programmatic Dependent Launch (PDL) for the ``"cute-dsl"``
backend, defaults to ``True``. PDL allows overlapping the tail of one kernel
with the start of the next for reduced launch latency. This parameter is
only used by the ``"cute-dsl"`` backend and is ignored by other backends.
🤖 Prompt for AI Agents
In `@flashinfer/gemm/gemm_base.py` around lines 3392 - 3397, Docstring
inconsistency: replace the underscore form `cute_dsl` with the exact backend
literal `"cute-dsl"` in backticks wherever it appears in the docstring for the
enable_pdl parameter (and the other occurrence noted around line 3402) so the
documentation matches the actual backend name; update the text referencing
enable_pdl to read `\"cute-dsl\"` (in backticks) to ensure consistent naming
across the docstring for the enable_pdl parameter and related descriptive lines.

Comment on lines +1807 to +1824
# Calculate A/B/SFA/SFB stages:
# Start with total smem per CTA (capacity / occupancy)
# Subtract reserved bytes and initial C stages bytes
# Divide remaining by bytes needed per A/B/SFA/SFB stage
num_ab_stage = (
smem_capacity // occupancy - (mbar_helpers_bytes + c_bytes)
) // ab_bytes_per_stage

# Refine epilogue stages:
# Calculate remaining smem after allocating for A/B/SFA/SFB stages and reserved bytes
# Add remaining unused smem to epilogue
num_c_stage += (
smem_capacity
- occupancy * ab_bytes_per_stage * num_ab_stage
- occupancy * (mbar_helpers_bytes + c_bytes)
) // (occupancy * c_bytes_per_stage)

return num_acc_stage, num_ab_stage, num_c_stage
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

No lower-bound guard on computed stage counts.

If smem_capacity / occupancy is too small for the chosen tile configuration, num_ab_stage (Line 1812) could compute to ≤ 0, and the C-stage refinement (Line 1818) could reduce num_c_stage below the initial value of 2. Both would lead to invalid pipeline configurations at runtime.

Consider adding a minimum-stage assertion or early-return:

Proposed guard
         num_ab_stage = (
             smem_capacity // occupancy - (mbar_helpers_bytes + c_bytes)
         ) // ab_bytes_per_stage
+        assert num_ab_stage >= 2, (
+            f"Not enough shared memory for at least 2 A/B stages "
+            f"(got {num_ab_stage}). Consider reducing tile size or cluster shape."
+        )
 
         # Refine epilogue stages:
🤖 Prompt for AI Agents
In `@flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py` around lines 1807 -
1824, The computed stage counts can become non-positive; after calculating
num_ab_stage and refining num_c_stage (using smem_capacity, occupancy,
mbar_helpers_bytes, c_bytes, ab_bytes_per_stage, c_bytes_per_stage), clamp them
to safe minima (e.g., num_ab_stage = max(1, num_ab_stage) and num_c_stage =
max(2, num_c_stage)) or raise a clear exception if the tile configuration is
invalid; do this just before the return in the function that computes stages so
the pipeline never receives <=0 stages and include a short error message if you
choose to raise.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #43830201: 16/20 passed

-----
When cudnn/cutlass backend is used, both a and b should quantized with nvfp4_quantize using the 128x4 scale factor layout and do_shuffle=False.
When trtllm backend is used, b must be quantized with 128x4 layout and `do_shuffle=True`. a can be quantized with either 128x4 or 8x4 layout (controlled by `use_8x4_sf_layout`) and `do_shuffle=False`.
When cute_dsl backend is used, both a and b should be quantized with nvfp4_quantize using the 128x4 scale factor layout and do_shuffle=False (same as cudnn/cutlass).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nv-yunzheq, I have a quick question, I am just wondering if the scale layout is the same, is there any reason for not using this for auto-tuner when backend=auto? Or can it still be considered experimental for now

Copy link
Copy Markdown
Collaborator

@bkryu bkryu Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @b8zhong, yes it is meant to be experimental at least for this PR; nothing inherently blocking the cute-dsl backend from being considered for autotuning when backend=auto. We'd like to get the kernel in first knowing that it has satisfactory perf, and then perform a more detailed perf study.

We may include the cute-dsl backend as a config that can be autotuned in a followup PR.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/gemm/gemm_base.py`:
- Around line 2957-2962: The new parameters backend, use_nvfp4, and enable_pdl
are triggering Ruff ARG001/ARG002 unused-argument warnings; to silence them,
explicitly consume or acknowledge the arguments in each function that declares
them (e.g., add a line like "_ = backend, use_nvfp4, enable_pdl" near the top of
the function) or append a per-function noqa (e.g., "# noqa: ARG001") to the def
line; apply this change consistently for every function that declares these
parameters (the overloads/definitions using backend, use_nvfp4, enable_pdl in
this file).

---

Duplicate comments:
In `@flashinfer/gemm/gemm_base.py`:
- Around line 3564-3567: The non-contiguous view created by launch_out = out.T
when swap_ab is True can break kernel assumptions; change this to produce a
contiguous tensor (e.g., launch_out = out.T.contiguous() or launch_out =
out.transpose(0,1).contiguous()) so launch_out is contiguous in memory before
passing to the kernel; update the swap_ab branch where launch_out, swap_ab, and
out.T are used to ensure the contiguous output is supplied.
- Around line 3444-3455: The cache key tuple named cache_key (constructed from
sf_vec_size, mma_tiler_mn, cluster_shape_mn, swap_ab, use_prefetch, kernel_type,
use_tma_store, enable_pdl, out_dtype) is missing any device identity and can
incorrectly reuse kernels across GPUs; update the cache_key to include a
device-unique identifier (e.g., the CUDA device ordinal or a stable device
identifier such as PCI bus id / device UUID or
torch.cuda.get_device_properties(device).name+index) so compiled kernels are
cached per-device. Ensure you retrieve the current device from the same context
where kernels are compiled and append that identifier to the cache_key tuple.

Comment on lines +2957 to 2962
backend: Literal[
"cudnn", "trtllm", "cutlass", "cute-dsl", "auto"
] = "auto", # unused
use_nvfp4: bool = True,
enable_pdl: bool = True, # unused
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silence Ruff unused-argument warnings for new backend/PDL parameters.

Ruff flags these as unused (ARG001/ARG002). If lint is enforced, consider explicitly consuming them (e.g., _ = backend, enable_pdl) or adding a # noqa: ARG001 on the def line.

💡 Example pattern
 def _check_mm_fp4_problem_size(..., backend=..., use_nvfp4=True, enable_pdl=True):
+    _ = backend, enable_pdl

Also applies to: 3017-3022, 3081-3086, 3108-3113, 3121-3153, 3612-3614

🧰 Tools
🪛 Ruff (0.15.1)

[warning] 2957-2957: Unused function argument: backend

(ARG001)


[warning] 2961-2961: Unused function argument: enable_pdl

(ARG001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gemm/gemm_base.py` around lines 2957 - 2962, The new parameters
backend, use_nvfp4, and enable_pdl are triggering Ruff ARG001/ARG002
unused-argument warnings; to silence them, explicitly consume or acknowledge the
arguments in each function that declares them (e.g., add a line like "_ =
backend, use_nvfp4, enable_pdl" near the top of the function) or append a
per-function noqa (e.g., "# noqa: ARG001") to the def line; apply this change
consistently for every function that declares these parameters (the
overloads/definitions using backend, use_nvfp4, enable_pdl in this file).

@nv-yunzheq nv-yunzheq enabled auto-merge (squash) February 19, 2026 17:53
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@flashinfer/gemm/gemm_base.py`:
- Around line 3818-3829: The docstring uses the backend name cute_dsl
inconsistently with the project's canonical backend id "cute-dsl"; update the
occurrences in this docstring (the Notes block and the enable_pdl description)
to use "cute-dsl" (including quotes where other backend names are quoted) so the
naming matches other docs and earlier comments referencing the cute-dsl backend
and the enable_pdl parameter.
- Around line 3458-3576: The cache key built for _CUTE_DSL_MM_FP4_KERNEL_CACHE
(variable cache_key) is missing device identity, causing cross-device kernel
reuse; update the cache_key creation in gemm_base.py to include the current CUDA
device identifier (e.g., torch.cuda.current_device() or equivalent from the
cute/runtime/stream) so compiled_gemm and max_active_clusters are cached
per-device; ensure the same device id is used when looking up and storing
entries in _CUTE_DSL_MM_FP4_KERNEL_CACHE (refer to symbols cache_key,
_CUTE_DSL_MM_FP4_KERNEL_CACHE, compiled_gemm, max_active_clusters).
- Around line 3578-3583: The swap_ab branch assigns a non-contiguous view via
out.T to launch_out which can break downstream kernels; instead ensure
launch_out is a contiguous transposed tensor by replacing the out.T usage with
an explicit transpose followed by making it contiguous (e.g., use
out.transpose(...).contiguous() or out.t().contiguous()) so that launch_out is
contiguous when swap_ab is true; update the block that sets launch_out (the
swap_ab conditional around launch_out and out) accordingly.
- Around line 2961-3151: The Ruff warnings come from unused parameters (backend,
enable_pdl, and similar) introduced in the FP4 requirement helpers; to silence
them, explicitly mark those parameters as deliberately unused by either renaming
to a leading-underscore variant or adding a single-line discard (e.g., del
backend, enable_pdl) at the top of each affected function; apply this change in
_check_mm_fp4_problem_size, _cudnn_gemm_fp4_requirement,
_trtllm_gemm_fp4_requirement, _cutlass_gemm_fp4_requirement, and
_cute_dsl_gemm_fp4_requirement so Ruff no longer reports unused-argument
warnings while keeping the API unchanged.

@nv-yunzheq
Copy link
Copy Markdown
Collaborator Author

@flashinfer-bot rerun failed

@yongwww
Copy link
Copy Markdown
Member

yongwww commented Feb 19, 2026

@flashinfer-bot stop

@yongwww
Copy link
Copy Markdown
Member

yongwww commented Feb 19, 2026

@flashinfer-bot rerun failed

@yongwww
Copy link
Copy Markdown
Member

yongwww commented Feb 19, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !311 has been updated with latest changes, and the CI pipeline #44404621 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #44404621: 9/20 passed

"cudnn", "trtllm", "cutlass", "cute-dsl", "auto"
] = "auto", # unused
use_nvfp4: bool = True,
enable_pdl: bool = True, # unused
Copy link
Copy Markdown
Collaborator

@dhiraj113 dhiraj113 Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unsed? If so, why has it been added?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the function to check if the given operation is runable. It has to be with the same exact argurment as mm_fp4 function itself. However, some of parameters are not used in this support check function

"cudnn", "trtllm", "cutlass", "cute-dsl", "auto"
] = "auto", # unused
use_nvfp4: bool = True,
enable_pdl: bool = True, # unused
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all these arguments marked as #unsed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, it needs to have the same function signature as mm_fp4. However, when checking if cute_dsl backend is viable, we don't need any of these input parameters to determine if it's runable, we only check if cute_dsl is installed or not.
The #unused is for pre-commit check. The pre-commit reformating would reject a function with unused parameters. We mark it to surpass the behavior

@nv-yunzheq nv-yunzheq merged commit 04c1b7b into flashinfer-ai:main Feb 21, 2026
38 of 44 checks passed
@nv-yunzheq nv-yunzheq deleted the cute_dsl_mmfp4 branch March 2, 2026 19:53
@coderabbitai coderabbitai Bot mentioned this pull request Mar 12, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants